CS计算机代考程序代写 Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 12

Classification II: Margins and SVMs

Outline

I Perceptron
I Margins
I Support vector machines
I Soft-margin SVM

1 / 12

Perceptron (1)

I Perceptron: a variant of SGD
I Uses hinge loss: `hinge(s) := max{0, 1− s}
I Uses conservative updates: only update when there is

classification mistake
I Step size η = 1
I Continues updating until all training examples correctly

classified by current linear classifier

-1.5 -1 -0.5 0 0.5 1 1.5
0

0.5

1

1.5

2

2.5

zero-one loss

hinge loss

Figure 1: Comparing hinge loss and zero-one loss

2 / 12

Perceptron (2)

I Start with w(0) = 0.
I For t = 1, 2, . . . until all training examples correctly classified

by current linear classifier:
I Pick a training example—call it (xt, yt)—misclassified by

w(t−1).
I Update:

w(t) := w(t−1) −∇`hinge(ytxTtw
(t−1)).

3 / 12

Perceptron (3)

I Note that whenever ytxTtw(t−1) ≤ 0,

∇`hinge(ytxTtw
(t−1)) = `′hinge(ytx

T
tw

(t−1)) · ytxt = −1 · ytxt.

I So update is
w(t) := w(t−1) + ytxt.

I Final solution is of the form

ŵ =

i∈S

yixi

for some multiset S of {1, . . . , n}.
I Possible to include same example index multiple times in S

4 / 12

Properties of Perceptron

I Suppose (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1} is linearly
separable.

I Does Perceptron find a linear separator? (Yes.) How quickly?
I Depends on margin achievable on the data set—how much

wiggle room there is for linear separators.

w

H

Figure 2: Linearly separable data

5 / 12

Margins (1)

I Margin achieved by w on i-th training example is the distance
from yixi to decision boundary:

γi(w) :=
yix

T
iw

‖w‖2
.

I Maximum margin achievable on all training examples:

γ? := max
w∈Rd

min
i
γi(w).

I Theorem: If training data is linearly separable, Perceptron
finds a linear separator after making at most (L/γ?)2 updates,
where L = maxi ‖xi‖2.

6 / 12

w

y1x1

θ

H

y1x
T
1w

‖w‖2

w

H

w

H

w

H

Figure 3: Margins

7 / 12

Margins (2)

I Let w be a linear separator:
yix

T
iw > 0, i = 1, . . . , n.

I Note: Scaling of w does not change margin achieved on i-th
example

γi(w) =
yix

T
iw

‖w‖2
.

I WLOG assume y1xT1w = mini yixTiw.
I So x1 is closest to decision boundary among all training

examples.
I Rescale w so that y1xT1w = 1.
I Distance from y1x1 to decision boundary is 1/‖w‖2.
I The shortest w satisfying

yix
T
iw ≥ 1, i = 1, . . . , n

gives the linear separator with the maximum margin on all
training examples.

8 / 12

Support vector machine

I Weight vector of maximum margin linear separator: defined as
solution to optimization problem

min
w∈Rd

1
2
‖w‖22

subject to yixTiw ≥ 1, i = 1, . . . , n.

(The 1/2 prefactor is customary but inconsequential.)
I This is the support vector machine (SVM) optimization

problem.
I Feasible when data are linearly separable.
I Note: Preference for the weight vector achieving the maximum

margin is another example of inductive bias.

9 / 12

Support vectors

I Just like least norm solution to normal equations (and ridge
regression), solution w to SVM problem can be written as∑n
i=1 αiyixi for some α1, . . . , αn ∈ R (in fact, αi ≥ 0)
I (Adding r ∈ Rd orthogonal to span of xi’s to weight vector can

only increase the length without changing the constraint values.)
I The examples (xi, yi) for which αi 6= 0 are called

support vector examples: they have yixTiw = 1 and are closest
to decision boundary.

w

H

Figure 4: Support vectors
10 / 12

Soft-margin SVM (1)

I What if not linearly separable? SVM problem has no solution.
I Introduce slack variables for constraints, and C ≥ 0:

min
w∈Rd,ξ1,…,ξn≥0

1
2
‖w‖22 + C

n∑
i=1

ξi

subject to yixTiw ≥ 1− ξi, i = 1, . . . , n.
I This is the soft-margin SVM optimization problem.

I A constrained convex optimization problem
I For given w, ξi/‖w‖2 is distance that xi has to move to satisfy
yix

T
iw ≥ 1.

w

H

Figure 5: Support vectors in soft-margin SVM
11 / 12

Soft-margin SVM (2)

I Equivalent unconstrained form:

min
w∈Rd

1
2
‖w‖22 + C

n∑
i=1

max{0, 1− yixTiw}.

I Rewriting using λ = 1/(nC) and `hinge:

min
w∈Rd

1
n

n∑
i=1

`hinge(yixTiw) +
λ

2
‖w‖22.

I Same template as ridge regression, Lasso, . . . !
I Data fitting term (using a surrogate loss function)
I Regularizer that promotes inductive bias
I λ controls trade-off of concerns

I Both SVM and soft-margin SVM can be kernelized

12 / 12

Classification II: Margins and SVMs