Machine learning lecture slides
COMS 4771 Fall 2020
0/12
Classification II: Margins and SVMs
Outline
Perceptron
Margins
Support vector machines Soft-margin SVM
1/12
Perceptron (1)
Perceptron: a variant of SGD
Uses hinge loss: lhinge(s) := max{0, 1 − s}
Uses conservative updates: only update when there is
classification mistake
Stepsizeη=1
Continues updating until all training examples correctly
classified by current linear classifier
2.5
2 hinge loss
1.5 1 0.5 0
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 1: Comparing hinge loss and zero-one loss
zero-one loss
2/12
Perceptron (2)
Start with w(0) = 0.
For t = 1, 2, . . . until all training examples correctly classified
by current linear classifier:
Pick a training example—call it (xt,yt)—misclassified by
w(t−1). Update:
w(t) := w(t−1) − ∇lhinge(ytxTt w(t−1)).
3/12
Perceptron (3)
Note that whenever ytxTt w(t−1) ≤ 0,
∇lhinge(ytxTt w(t−1)) = l′hinge(ytxTt w(t−1)) · ytxt = −1 · ytxt.
So update is
Final solution is of the form
w(t) := w(t−1) + ytxt. wˆ = y i x i
i∈S
for some multiset S of {1,…,n}.
Possible to include same example index multiple times in S
4/12
Properties of Perceptron
Suppose(x1,y1),…,(xn,yn)∈Rd×{−1,+1}islinearly separable.
Does Perceptron find a linear separator? (Yes.) How quickly?
Depends on margin achievable on the data set—how much
wiggle room there is for linear separators.
w
H
Figure 2: Linearly separable data
5/12
Margins (1)
Margin achieved by w on i-th training example is the distance from yixi to decision boundary:
γ i ( w ) : = y i x Ti w . ∥w∥2
Maximum margin achievable on all training examples: γ⋆ := max min γi(w).
w∈Rd i
Theorem: If training data is linearly separable, Perceptron finds a linear separator after making at most (L/γ⋆)2 updates, where L = maxi ∥xi∥2.
6/12
y 1 x T1 w
∥w∥2 y1x1
wθw
HH
Figure 3: Margins
7/12
Margins (2)
Let w be a linear separator:
yixTi w > 0, i = 1,…,n.
Note: Scaling of w does not change margin achieved on i-th example
γ i ( w ) = y i x Ti w . ∥w∥2
WLOG assume y1xT1w = mini yixTi w.
So x1 is closest to decision boundary among all training
examples.
Rescale w so that y1xT1w = 1.
Distance from y1x1 to decision boundary is 1/∥w∥2.
The shortest w satisfying
yixTiw≥1, i=1,…,n
gives the linear separator with the maximum margin on all training examples.
8/12
Support vector machine
Weight vector of maximum margin linear separator: defined as solution to optimization problem
min
w∈Rd subject to
1 ∥w∥2 2
yixTi w ≥ 1, i = 1,…,n.
(The 1/2 prefactor is customary but inconsequential.)
This is the support vector machine (SVM) optimization
problem.
Feasible when data are linearly separable.
Note: Preference for the weight vector achieving the maximum
margin is another example of inductive bias.
9/12
Support vectors
Just like least norm solution to normal equations (and ridge regression), solution w to SVM problem can be written as ni=1 αiyixi for some α1,…,αn ∈ R (in fact, αi ≥ 0)
(Adding r ∈ Rd orthogonal to span of xi’s to weight vector can only increase the length without changing the constraint values.)
The examples (xi, yi) for which αi ̸= 0 are called
support vector examples: they have yixTi w = 1 and are closest to decision boundary.
w
H
10/12
Soft-margin SVM (1)
What if not linearly separable? SVM problem has no solution. Introduce slack variables for constraints, and C ≥ 0:
1 2 n min ∥w∥2 + C ξi
w∈Rd ,ξ1 ,…,ξn ≥0 2 i=1
subjectto yixTiw≥1−ξi, i=1,…,n.
This is the soft-margin SVM optimization problem. A constrained convex optimization problem
For given w, ξi/∥w∥2 is distance that xi has to move to satisfy y i x Ti w ≥ 1 .
w
H
11/12
Soft-margin SVM (2)
Equivalent unconstrained form:
1 n
min ∥w∥2 + C max{0, 1 − yixTi w}.
w∈Rd 2 i=1
Rewriting using λ = 1/(nC) and lhinge:
1n λ
min lhinge(yixTi w) + ∥w∥2.
w∈Rd n i=1 2
Same template as ridge regression, Lasso, . . . !
Data fitting term (using a surrogate loss function)
Regularizer that promotes inductive bias λ controls trade-off of concerns
Both SVM and soft-margin SVM can be kernelized
12/12