CS373 Data Mining and�
Machine Learning�
Lecture 2
Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
Today’s topics
• Perceptron, convergence
- the prediction game
- mistakes, margin, and generalization
• Maximum margin classifier — support vector machine
- estimation, properties
- allowing misclassified points
Recall: linear classifiers
• A linear classifier (through origin) with parameters
divides the space into positive and negative halves
decision boundary
+
–
discriminant function
• A sequence of examples and labels
• The perceptron algorithm applied to the sequence
• We would like to bound the number of mistakes that the
algorithm makes
The perceptron algorithm
Mistakes and margin
Easy problem
– large margin
– few mistakes
Harder problem
– small margin
– many mistakes
–
+
+
+
+
–
– –
–
+
+
+
+
–
– –
The target classifier
• We can quantify how hard the problem is by assuming
that there exists a target classifier that achieves a
certain margin
• The geometric margin is the closest distance to the
separating boundary
• Our “target” classifier is one that achieves the largest
geometric margin (max-margin classifier)
+
–
–
–
+
+target distance from the
point to the
separating
boundary
Perceptron mistake guarantee
• If the sequence of examples and labels is such that there
exists with geometric margin and
then the perceptron algorithm makes at most
mistakes along the (infinite) sequence!
• Key points
- large geometric margin relative to the norm of the examples
implies few mistakes
- the result does not depend on the dimension of the
examples (the number of parameters)
Mistake guarantee: proof
• We show that after k updates (mistakes),
Mistake guarantee: proof
• We show that after k updates (mistakes),
• Let the kth mistake be on the ith example
Note:
Since θ0 = 0 then θ(k) � θ* ≥ k γg ||θ*||
margin
Mistake guarantee: proof
• We show that after k updates (mistakes),
• Let the kth mistake be on the ith example
Note:
Since θ0 = 0 then ||θ(k)||2 ≤ k R2
mistake: ≤ 0
Mistake guarantee: proof
• We have shown that after k updates (mistakes),
• As a result, cosine
Summary (perceptron)
• By analyzing the simple perceptron algorithm, we were
able to relate the number of mistakes, geometric
margin, and generalization
• The perceptron algorithm converges to a classifier close
to the max-margin target classifier�
�
In cases where we are given a fixed set of training
examples, and they are linearly separable, we can find
and use the maximum margin classifier directly
Maximum margin classifier
+
–
–
–
+
+ geometric margin
Maximum margin classifier
+
–
–
–
+
+ geometric margin
Maximum margin classifier
+
–
–
–
+
+ geometric margin
Maximum margin classifier
+
–
–
–
+
+ geometric margin
Maximum margin classifier
+
–
–
–
+
+ geometric margin
Maximum margin classifier
+
–
–
–
+
+ geometric margin
Support vector machine
• This is a quadratic programming problem (quadratic
objective, linear constraints)
• The solution is unique, typically obtained in the dual
+
–
–
–
+
+ geometric margin
Support vector machine
+
–
–
–
+
+
= geometric margin
Support vector machine
+
–
–
–
+
+
= geometric margin
active constraints
= support vectors
support vector
The solution is
sparse
Is sparse solution good?
• We can simulate test performance by evaluating Leave-
One-Out Cross-Validation error
+
–
–
–
+
+
support vector
Intuitively:
if you remove the support vector from the training set, and you receive the
support vector as a test point, then you would make a mistake
Linear classifiers (with offset)
• A linear classifier with parameters
decision
boundary
+
–
Support vector machine
• Still a quadratic programming problem (quadratic
objective, linear constraints)
+
–
–
–
+
+ geometric margin
The impact of offset
• Adding the offset parameter to the linear classifier can
substantially increase the margin
Support vector machine
• Several desirable properties
- maximizes the margin on the training set ( good
generalization)
- the solution is unique and sparse ( good generalization)
• But…
- the solution is sensitive to outliers, labeling errors, as they
may drastically change the resulting max-margin boundary
- if the training set is not linearly separable, there’s no
solution!
Support vector machine
• Relaxed quadratic optimization problem
slack variables
permit us to violate
some of the margin
constraints
penalty for constraint violation
Support vector machine
• Relaxed quadratic optimization problem
slack variables
permit us to violate
some of the margin
constraints
penalty for constraint violation
Support vector machine
• Relaxed quadratic optimization problem
slack variables
permit us to violate
some of the margin
constraints
penalty for constraint violation
Support vector machine
• Relaxed quadratic optimization problem
Support vectors and slack
• The solution now has three types of support vectors
constraint is tight
but there’s no slack
Support vectors and slack
• The solution now has three types of support vectors
constraint is tight
but there’s no slack
non-zero slack but the
point is not misclassified
Support vectors and slack
• The solution now has three types of support vectors
constraint is tight
but there’s no slack
non-zero slack but the
point is not misclassified
non-zero slack and the
point is misclassified
Examples
• C=100
Examples
• C=10
Examples
• C=1
Examples
• C=0.1
Examples
• C potentially affects the solution even in the separable
case
• C = 1
Examples
• C potentially affects the solution even in the separable
case
• C = 0.1
Examples
• C potentially affects the solution even in the separable
case
• C = 0.01