Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 12
Classification II: Margins and SVMs
Outline
I Perceptron
I Margins
I Support vector machines
I Soft-margin SVM
1 / 12
Perceptron (1)
I Perceptron: a variant of SGD
I Uses hinge loss: `hinge(s) := max{0, 1− s}
I Uses conservative updates: only update when there is
classification mistake
I Step size η = 1
I Continues updating until all training examples correctly
classified by current linear classifier
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
zero-one loss
hinge loss
Figure 1: Comparing hinge loss and zero-one loss
2 / 12
Perceptron (2)
I Start with w(0) = 0.
I For t = 1, 2, . . . until all training examples correctly classified
by current linear classifier:
I Pick a training example—call it (xt, yt)—misclassified by
w(t−1).
I Update:
w(t) := w(t−1) −∇`hinge(ytxTtw
(t−1)).
3 / 12
Perceptron (3)
I Note that whenever ytxTtw(t−1) ≤ 0,
∇`hinge(ytxTtw
(t−1)) = `′hinge(ytx
T
tw
(t−1)) · ytxt = −1 · ytxt.
I So update is
w(t) := w(t−1) + ytxt.
I Final solution is of the form
ŵ =
∑
i∈S
yixi
for some multiset S of {1, . . . , n}.
I Possible to include same example index multiple times in S
4 / 12
Properties of Perceptron
I Suppose (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1} is linearly
separable.
I Does Perceptron find a linear separator? (Yes.) How quickly?
I Depends on margin achievable on the data set—how much
wiggle room there is for linear separators.
w
H
Figure 2: Linearly separable data
5 / 12
Margins (1)
I Margin achieved by w on i-th training example is the distance
from yixi to decision boundary:
γi(w) :=
yix
T
iw
‖w‖2
.
I Maximum margin achievable on all training examples:
γ? := max
w∈Rd
min
i
γi(w).
I Theorem: If training data is linearly separable, Perceptron
finds a linear separator after making at most (L/γ?)2 updates,
where L = maxi ‖xi‖2.
6 / 12
w
y1x1
θ
H
y1x
T
1w
‖w‖2
w
H
w
H
w
H
Figure 3: Margins
7 / 12
Margins (2)
I Let w be a linear separator:
yix
T
iw > 0, i = 1, . . . , n.
I Note: Scaling of w does not change margin achieved on i-th
example
γi(w) =
yix
T
iw
‖w‖2
.
I WLOG assume y1xT1w = mini yixTiw.
I So x1 is closest to decision boundary among all training
examples.
I Rescale w so that y1xT1w = 1.
I Distance from y1x1 to decision boundary is 1/‖w‖2.
I The shortest w satisfying
yix
T
iw ≥ 1, i = 1, . . . , n
gives the linear separator with the maximum margin on all
training examples.
8 / 12
Support vector machine
I Weight vector of maximum margin linear separator: defined as
solution to optimization problem
min
w∈Rd
1
2
‖w‖22
subject to yixTiw ≥ 1, i = 1, . . . , n.
(The 1/2 prefactor is customary but inconsequential.)
I This is the support vector machine (SVM) optimization
problem.
I Feasible when data are linearly separable.
I Note: Preference for the weight vector achieving the maximum
margin is another example of inductive bias.
9 / 12
Support vectors
I Just like least norm solution to normal equations (and ridge
regression), solution w to SVM problem can be written as∑n
i=1 αiyixi for some α1, . . . , αn ∈ R (in fact, αi ≥ 0)
I (Adding r ∈ Rd orthogonal to span of xi’s to weight vector can
only increase the length without changing the constraint values.)
I The examples (xi, yi) for which αi 6= 0 are called
support vector examples: they have yixTiw = 1 and are closest
to decision boundary.
w
H
Figure 4: Support vectors
10 / 12
Soft-margin SVM (1)
I What if not linearly separable? SVM problem has no solution.
I Introduce slack variables for constraints, and C ≥ 0:
min
w∈Rd,ξ1,…,ξn≥0
1
2
‖w‖22 + C
n∑
i=1
ξi
subject to yixTiw ≥ 1− ξi, i = 1, . . . , n.
I This is the soft-margin SVM optimization problem.
I A constrained convex optimization problem
I For given w, ξi/‖w‖2 is distance that xi has to move to satisfy
yix
T
iw ≥ 1.
w
H
Figure 5: Support vectors in soft-margin SVM
11 / 12
Soft-margin SVM (2)
I Equivalent unconstrained form:
min
w∈Rd
1
2
‖w‖22 + C
n∑
i=1
max{0, 1− yixTiw}.
I Rewriting using λ = 1/(nC) and `hinge:
min
w∈Rd
1
n
n∑
i=1
`hinge(yixTiw) +
λ
2
‖w‖22.
I Same template as ridge regression, Lasso, . . . !
I Data fitting term (using a surrogate loss function)
I Regularizer that promotes inductive bias
I λ controls trade-off of concerns
I Both SVM and soft-margin SVM can be kernelized
12 / 12
Classification II: Margins and SVMs