程序代写代做代考 data mining algorithm CS373 Data Mining and�

CS373 Data Mining and�
Machine Learning�

Lecture 2
Jean Honorio

Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

Today’s topics
• Perceptron, convergence

-  the prediction game
- mistakes, margin, and generalization

• Maximum margin classifier — support vector machine
- estimation, properties
- allowing misclassified points

Recall: linear classifiers
• A linear classifier (through origin) with parameters

divides the space into positive and negative halves

decision boundary

+

discriminant function

• A sequence of examples and labels

• The perceptron algorithm applied to the sequence

• We would like to bound the number of mistakes that the
algorithm makes

The perceptron algorithm

Mistakes and margin

Easy problem
– large margin
– few mistakes

Harder problem
– small margin
– many mistakes

+
+

+
+


– –

+

+
+

+

– –

The target classifier
• We can quantify how hard the problem is by assuming

that there exists a target classifier that achieves a
certain margin

• The geometric margin is the closest distance to the
separating boundary

• Our “target” classifier is one that achieves the largest
geometric margin (max-margin classifier)

+


+
+target distance from the

point to the
separating
boundary

Perceptron mistake guarantee
• If the sequence of examples and labels is such that there

exists with geometric margin and

then the perceptron algorithm makes at most

mistakes along the (infinite) sequence!

• Key points
-  large geometric margin relative to the norm of the examples

implies few mistakes

- the result does not depend on the dimension of the
examples (the number of parameters)

Mistake guarantee: proof
• We show that after k updates (mistakes),

Mistake guarantee: proof
• We show that after k updates (mistakes),

•  Let the kth mistake be on the ith example

Note:
Since θ0 = 0 then θ(k) � θ* ≥ k γg ||θ*||

margin

Mistake guarantee: proof
• We show that after k updates (mistakes),

•  Let the kth mistake be on the ith example

Note:
Since θ0 = 0 then ||θ(k)||2 ≤ k R2

mistake: ≤ 0

Mistake guarantee: proof
• We have shown that after k updates (mistakes),

• As a result, cosine

Summary (perceptron)
• By analyzing the simple perceptron algorithm, we were

able to relate the number of mistakes, geometric
margin, and generalization

• The perceptron algorithm converges to a classifier close
to the max-margin target classifier�

In cases where we are given a fixed set of training
examples, and they are linearly separable, we can find
and use the maximum margin classifier directly

Maximum margin classifier

+


+
+ geometric margin

Maximum margin classifier

+


+
+ geometric margin

Maximum margin classifier

+


+
+ geometric margin

Maximum margin classifier

+


+
+ geometric margin

Maximum margin classifier

+


+
+ geometric margin

Maximum margin classifier

+


+
+ geometric margin

Support vector machine

• This is a quadratic programming problem (quadratic
objective, linear constraints)

• The solution is unique, typically obtained in the dual

+


+
+ geometric margin

Support vector machine

+


+
+

= geometric margin

Support vector machine

+


+
+

= geometric margin

active constraints
= support vectors

support vector

The solution is
sparse

Is sparse solution good?

• We can simulate test performance by evaluating Leave-
One-Out Cross-Validation error

+


+
+

support vector

Intuitively:
if you remove the support vector from the training set, and you receive the
support vector as a test point, then you would make a mistake

Linear classifiers (with offset)
• A linear classifier with parameters

decision
boundary

+

Support vector machine

• Still a quadratic programming problem (quadratic
objective, linear constraints)

+


+

+ geometric margin

The impact of offset
• Adding the offset parameter to the linear classifier can

substantially increase the margin

Support vector machine
• Several desirable properties

- maximizes the margin on the training set ( good
generalization)

-  the solution is unique and sparse ( good generalization)
• But…

-  the solution is sensitive to outliers, labeling errors, as they
may drastically change the resulting max-margin boundary

-  if the training set is not linearly separable, there’s no
solution!

Support vector machine
• Relaxed quadratic optimization problem

slack variables
permit us to violate
some of the margin

constraints

penalty for constraint violation

Support vector machine
• Relaxed quadratic optimization problem

slack variables
permit us to violate
some of the margin

constraints

penalty for constraint violation

Support vector machine
• Relaxed quadratic optimization problem

slack variables
permit us to violate
some of the margin

constraints

penalty for constraint violation

Support vector machine
• Relaxed quadratic optimization problem

Support vectors and slack
• The solution now has three types of support vectors

constraint is tight
but there’s no slack

Support vectors and slack
• The solution now has three types of support vectors

constraint is tight
but there’s no slack

non-zero slack but the
point is not misclassified

Support vectors and slack
• The solution now has three types of support vectors

constraint is tight
but there’s no slack

non-zero slack but the
point is not misclassified

non-zero slack and the
point is misclassified

Examples
• C=100

Examples
• C=10

Examples
• C=1

Examples
• C=0.1

Examples
• C potentially affects the solution even in the separable

case

• C = 1

Examples
• C potentially affects the solution even in the separable

case

• C = 0.1

Examples
• C potentially affects the solution even in the separable

case

• C = 0.01