程序代写代做代考 C kernel Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/12

Classification II: Margins and SVMs

Outline
􏰛 Perceptron
􏰛 Margins
􏰛 Support vector machines 􏰛 Soft-margin SVM
1/12

Perceptron (1)
􏰛 Perceptron: a variant of SGD
􏰛 Uses hinge loss: lhinge(s) := max{0, 1 − s}
􏰛 Uses conservative updates: only update when there is
classification mistake
􏰛 Stepsizeη=1
􏰛 Continues updating until all training examples correctly
classified by current linear classifier
2.5
2 hinge loss
1.5 1 0.5 0
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 1: Comparing hinge loss and zero-one loss
zero-one loss
2/12

Perceptron (2)
􏰛 Start with w(0) = 0.
􏰛 For t = 1, 2, . . . until all training examples correctly classified
by current linear classifier:
􏰛 Pick a training example—call it (xt,yt)—misclassified by
w(t−1). 􏰛 Update:
w(t) := w(t−1) − ∇lhinge(ytxTt w(t−1)).
3/12

Perceptron (3)
􏰛 Note that whenever ytxTt w(t−1) ≤ 0,
∇lhinge(ytxTt w(t−1)) = l′hinge(ytxTt w(t−1)) · ytxt = −1 · ytxt.
􏰛 So update is
􏰛 Final solution is of the form
w(t) := w(t−1) + ytxt. wˆ = 􏰊 y i x i
i∈S
for some multiset S of {1,…,n}.
􏰛 Possible to include same example index multiple times in S
4/12

Properties of Perceptron
􏰛 Suppose(x1,y1),…,(xn,yn)∈Rd×{−1,+1}islinearly separable.
􏰛 Does Perceptron find a linear separator? (Yes.) How quickly?
􏰛 Depends on margin achievable on the data set—how much
wiggle room there is for linear separators.
w
H
Figure 2: Linearly separable data
5/12

Margins (1)
􏰛 Margin achieved by w on i-th training example is the distance from yixi to decision boundary:
γ i ( w ) : = y i x Ti w . ∥w∥2
􏰛 Maximum margin achievable on all training examples: γ⋆ := max min γi(w).
w∈Rd i
􏰛 Theorem: If training data is linearly separable, Perceptron finds a linear separator after making at most (L/γ⋆)2 updates, where L = maxi ∥xi∥2.
6/12

y 1 x T1 w
∥w∥2 y1x1
wθw
HH
Figure 3: Margins
7/12

Margins (2)
􏰛 Let w be a linear separator:
yixTi w > 0, i = 1,…,n.
􏰛 Note: Scaling of w does not change margin achieved on i-th example
γ i ( w ) = y i x Ti w . ∥w∥2
􏰛 WLOG assume y1xT1w = mini yixTi w.
􏰛 So x1 is closest to decision boundary among all training
examples.
􏰛 Rescale w so that y1xT1w = 1.
􏰛 Distance from y1x1 to decision boundary is 1/∥w∥2.
􏰛 The shortest w satisfying
yixTiw≥1, i=1,…,n
gives the linear separator with the maximum margin on all training examples.
8/12

Support vector machine
􏰛 Weight vector of maximum margin linear separator: defined as solution to optimization problem
min
w∈Rd subject to
1 ∥w∥2 2
yixTi w ≥ 1, i = 1,…,n.
(The 1/2 prefactor is customary but inconsequential.)
􏰛 This is the support vector machine (SVM) optimization
problem.
􏰛 Feasible when data are linearly separable.
􏰛 Note: Preference for the weight vector achieving the maximum
margin is another example of inductive bias.
9/12

Support vectors
􏰛 Just like least norm solution to normal equations (and ridge regression), solution w to SVM problem can be written as 􏰉ni=1 αiyixi for some α1,…,αn ∈ R (in fact, αi ≥ 0)
􏰛 (Adding r ∈ Rd orthogonal to span of xi’s to weight vector can only increase the length without changing the constraint values.)
􏰛 The examples (xi, yi) for which αi ̸= 0 are called
support vector examples: they have yixTi w = 1 and are closest to decision boundary.
w
H
10/12

Soft-margin SVM (1)
􏰛 What if not linearly separable? SVM problem has no solution. 􏰛 Introduce slack variables for constraints, and C ≥ 0:
1 2 􏰊n min ∥w∥2 + C ξi
w∈Rd ,ξ1 ,…,ξn ≥0 2 i=1
subjectto yixTiw≥1−ξi, i=1,…,n.
􏰛 This is the soft-margin SVM optimization problem. 􏰛 A constrained convex optimization problem
􏰛 For given w, ξi/∥w∥2 is distance that xi has to move to satisfy y i x Ti w ≥ 1 .
w
H
11/12

Soft-margin SVM (2)
􏰛 Equivalent unconstrained form:
1 􏰊n
min ∥w∥2 + C max{0, 1 − yixTi w}.
w∈Rd 2 i=1
􏰛 Rewriting using λ = 1/(nC) and lhinge:
1􏰊n λ
min lhinge(yixTi w) + ∥w∥2.
w∈Rd n i=1 2
􏰛 Same template as ridge regression, Lasso, . . . !
􏰛 Data fitting term (using a surrogate loss function)
􏰛 Regularizer that promotes inductive bias 􏰛 λ controls trade-off of concerns
􏰛 Both SVM and soft-margin SVM can be kernelized
12/12