代写代考 Feb 15, 2022

Feb 15, 2022
Feb 15, 2022 1 / 66

Today’s lecture:

Copyright By PowCoder代写 加微信 powcoder

Support Vector Machines: one of the most widely used classification model
We will focus on linear SVM today (non-linear SVM next week!)
Derive the SVM learning objective (in two ways)
Solve the optimization problem Get insight from its dual problem
(Requires some background knowledge on convex optimization)
Feb 15, 2022

Part I: Derive the SVM Objective
Start with the inductive bias: what makes a good linear decision boundary? Start with the loss function and regularization
Feb 15, 2022 3 / 66

Maximum Margin Classifier
Feb 15, 2022 4 / 66

Linearly Separable Data
Consider a linearly separable dataset D: Xzh
Find a separating hyperplane such that wTxi >0 for all xi where yi =+1
wTxi <0 for all xi where yi =−1 #xx : |-XxXx OOO Feb 15, 2022 The Perceptron Algorithm Initialize w ← 0 While not converged (exists misclassified examples) For (xi , yi ) ∈ D If yi w T xi < 0 (wrong prediction) Update w ← w + yi xi Intuition: move towards misclassified positive examples and away from negative examples Guarantees to find a zero-error classifier (if one exists) in finite steps What is the loss function if we consider this as a SGD algorithm? Feb 15, 2022 6 / 66 Maximum-Margin Separating Hyperplane For separable data, there are infinitely many zero-error classifiers. Which one do we pick? (Perceptron does not return a unique solution.) l-:*:*. ÷: xxx o Feb 15, 2022 7 / 66 Maximum-Margin Separating Hyperplane We prefer the classifier that is farthest from both classes of points Geometric margin: smallest distance between the hyperplane and the points Maximum margin: largest distance to the closest points OO: yammering XxXx OoO xxx o Feb 15, 2022 Geometric Margin We want to maximize the distance between the separating hyperplane and the cloest points. Let’s formalize the problem. Definition (separating hyperplane) We say (xi,yi) for i = 1,...,n are linearly separable if there is a w ∈ Rd and b ∈ R such that yi(wTxi +b)>0 for all i. The set {v ∈Rd |wTv+b=0} is called a separating hyperplane.
Definition (geometric margin)
Let H be a hyperplane that separates the data (xi , yi ) for i = 1, . . . , n. The geometric margin of this hyperplane is
mind(xi,H), i
the distance from the hyperplane to the closest data point.
Feb 15, 2022 9 / 66

Distance between a Point and a Hyperplane
Projection of v ∈ Rd onto w ∈ Rd: v·w ∥w ∥2
Distance between xi and H:
􏳕􏳕wTxi +b􏳕􏳕 yi(wTxi +b) d(xi,H)=􏳕 ∥w∥ 􏳕= ∥w∥
Feb 15, 2022

Maximize the Margin
We want to maximize the geometric margin:
maximize mind(xi,H).
Given separating hyperplane H = 􏳓v | w T v + b = 0􏳔, we have
maximize minyi(wTxi +b). i ∥w∥2
Let’s remove the inner minimization problem by maximize M
subject to yi(wTxi+b) 􏳬M for all i ∥w ∥2
Note that the solution is not unique (why?).
Feb 15, 2022

Maximize the Margin
Let’s fix the norm ∥w∥2 to 1/M to obtain:
It’s equivalent to solving the minimization problem minimize 21 ∥w ∥2
subject to yi(wTxi +b)􏳬1 Note that yi (w T xi + b) is the (functional) margin.
subject to yi(wTxi +b)􏳬1
In words, it finds the minimum norm solution which has a margin of at least 1 on all examples.
Feb 15, 2022 12 / 66

Soft Margin SVM
What if the data is not linearly separable?
For any w, there will be points with a negative margin. Introduce slack variables to penalize small margin:
minimize 1∥w∥2 + C 􏲿n ξi 2 2 n i=1
subject to yi(wTxi +b)􏳬1−ξi for all i ξi􏳬0 foralli
Ifξi =0∀i,it’sreducedtohardSVM. What does ξi > 0 mean?
What does C control?
Feb 15, 2022

d(xi,H)= yi(wTxi+b) 􏳬 1−ξi , thus ξi measures the violation by multiples of the geometric
∥w ∥2 ∥w ∥2
ξi = 1: xi lies on the hyperplane
ξi = 3: xi is past 2 margin width beyond the decision hyperplane
Feb 15, 2022

Minimize the Hinge Loss
Feb 15, 2022 15 / 66

Perceptron Loss
If we do ERM with this loss function, what happens?
l(x,y,w) = max(0,−ywT x)
Feb 15, 2022 16 / 66

Hinge Loss
SVM/Hinge loss: lHinge = max {1 − m, 0} = (1 − m)+ Margin m = yf (x); “Positive part” (x)+ = x1(x 􏳬 0).
Hinge is a convex, upper bound on 0 − 1 loss. Not differentiable at m = 1. We have a “margin error” when m < 1. Feb 15, 2022 Support Vector Machine Using ERM: Hypothesis space F=􏳓f(x)=wTx+b|w ∈Rd,b∈R􏳔. l2 regularization (Tikhonov style) Hinge loss l(m) = max{1−m,0} = (1−m)+ The SVM prediction function is the solution to w∈Rd,b∈R 2 n Not differentiable because of the max min ||w||2 + max􏳑0,1−yi 􏳂wT xi +b􏳃􏳒. Feb 15, 2022 SVM as a Constrained Optimization Problem The SVM optimization problem is equivalent to Which is equivalent to minimize subject to minimize subject to ξi 􏳬 􏳑1−yi 􏳂wT xi +b􏳃􏳒 for i = 1,...,n ξi 􏳬0for i =1,...,n ξi 􏳬 max􏳑0,1−yi 􏳂wT xi +b􏳃􏳒. Feb 15, 2022 Two ways to derive the SVM optimization problem: Maximize the (geometric) margin Minimize the hinge loss with l2 regularization Both leads to the minimum norm solution satisfying certain margin constraints. Hard-margin SVM: all points must be correctly classified with the margin constraints Soft-margin SVM: allow for margin constraint violation with some penalty Feb 15, 2022 Part II: Subgradient Descent for SVM Now that we have the objective, can we do SGD on it? Subgradient: generalize gradient for non-differentiable convex functions Feb 15, 2022 21 / 66 SVM Optimization Problem (no intercept) SVM objective function: max􏳑0,1−yiwTxi􏳒+λ||w||2. Not differentiable... but let’s think about gradient descent anyway. J(w)= n Hinge loss: l(m) = max(0, 1 − m) 􏳊 1 􏳇n 􏳋 ∇wJ(w) = ∇w n l􏳑yiwTxi􏳒+λ||w||2 = n ∇wl􏳑yiwTxi􏳒+2λw i=1 Feb 15, 2022 “Gradient” of SVM Objective Derivative of hinge loss l(m) = max(0, 1 − m):  By chain rule, we have l ′ ( m ) =  − 1 m > 1 m < 1 m = 1 ∇wl􏳑yiwTxi􏳒 = l′􏳑yiwTxi􏳒yixi  0 y i w T x i > 1 = −yixi yiwTxi <1 undefined yiwTxi =1 Feb 15, 2022 “Gradient” of SVM Objective  0 y i w T x i > 1 ∇wl􏳑yiwTxi􏳒 = −yixi yiwTxi <1 undefined yiwTxi =1 􏳊 1 􏳇n 􏳋 ∇wJ(w) = ∇w n l􏳑yiwTxi􏳒+λ||w||2 = n ∇wl􏳑yiwTxi􏳒+2λw 􏳆i=1 = n i:yiwTxi<1(−yixi)+2λw all yiw xi ̸=1 undefined otherwise Feb 15, 2022 Gradient Descent on SVM Objective? The gradient of the SVM objective is when yi w T xi ̸= 1 for all i , and otherwise is undefined. Potential arguments for why we shouldn’t care about the points of nondifferentiability: If we start with a random w, will we ever hit exactly yiwTxi =1? If we did, could we perturb the step size by ε to miss such a point? Does it even make sense to check yi w T xi = 1 with floating point numbers? However, would gradient descent work if the objective is not differentiable? i:yi wT xi <1 (−yixi)+2λw Feb 15, 2022 Subgradient Feb 15, 2022 26 / 66 First-Order Condition for Convex, Differentiable Function Suppose f : Rd → R is convex and differentiable Then for any x , y ∈ Rd f (y ) 􏳬 f (x ) + ∇f (x )T (y − x ) The linear approximation to f at x is a global underestimator of f : This implies that if ∇f (x) = 0 then x is a global minimizer of f . Figure from Boyd & Vandenberghe Fig. 3.2; Proof in Section 3.1.3 Feb 15, 2022 Subgradients Definition A vector g ∈Rd is a subgradient of a convex function f :Rd →R at x if for all z, f(z)􏳬f(x)+gT(z−x). Blue is a graph of f (x). Each red line x 􏰁→ f (x0)+gT (x −x0) is a global lower bound on f (x). Feb 15, 2022 Properties Definitions The set of all subgradients at x is called the subdifferential: ∂f (x) f is subdifferentiable at x if ∃ at least one subgradient at x. For convex functions: f is differentiable at x iff ∂f (x) = {∇f (x)}. Subdifferential is always non-empty (∂f (x) = ∅ =⇒ f is not convex) x is the global optimum iff 0 ∈ ∂f (x). For non-convex functions: The subdifferential may be an empty set (no global underestimator). Feb 15, 2022 Subdifferential of Absolute Value Consider f (x) = |x| Plot on right shows {(x,g)|x ∈R, g ∈∂f(x)} E364b: Subgradients Slides Feb 15, 2022 30 / 66 Subgradients of f (x1, x2) = |x1| + 2 |x2| Let’s find the subdifferential of f (x1,x2) = |x1|+2|x2| at (3,0). First coordinate of subgradient must be 1, from |x1| part (at x1 = 3). Second coordinate of subgradient can be anything in [−2, 2]. So graph of h(x1,x2) = f (3,0)+gT (x1 −3,x2 −0) is a global underestimate of f (x1,x2), for any g = (g1,g2), where g1 = 1 and g2 ∈ [−2,2]. Subdifferential on Contour Plot Contour plot of f (x1,x2) = |x1|+2|x2|, with set of subgradients at (3,0). . Plot courtesy of . Feb 15, 2022 32 / 66 Basic Rules for Calculating Subdifferential Non-negative scaling: ∂αf (x) = α∂f (x) for (α > 0)
Summation: ∂(f1(x)+f2(x)) = d1 +d2 for any d1 ∈ ∂f1 and d2 ∈ ∂f2
Composing with affine functions: ∂f(Ax+b)=AT∂f(z) where z =Ax+b
max: convex combinations of argmax gradients 
∇f1(x) ∂max(f1(x),f2(x)) = ∇f2(x)
if f1(x) > f2(x), if f1(x) < f2(x), if f1(x) = f2(x), where θ ∈ [0, 1]. ∇θf1(x)+(1−θ)f2(x) Feb 15, 2022 Subgradient Descent Feb 15, 2022 34 / 66 Gradient orthogonal to level sets We know that gradient points to the fastest ascent direction. What about subgradients? Plot courtesy of . Feb 15, 2022 35 / 66 Contour Lines and Subgradients A hyperplane H supports a set S if H intersects S and all of S lies one one side of H. Claim: If f : Rd → R has subgradient g at x0, then the hyperplane H orthogonal to g at x0 must support the level set S =􏳓x ∈Rd |f(x)=f(x0)􏳔. For any y, we have f (y) 􏳬 f (x0)+gT (y −x0). (def of subgradient) If y is strictly on side of H that g points in, then gT (y−x0)>0.
So f(y)>f(x0).
Soy isnotinthelevelsetS.
∴AllelementsofS mustbeonH oronthe−g sideofH.
Feb 15, 2022

Subgradient of f (x1, x2) = |x1| + 2 |x2|
Points on g side of H have larger f -values than f (x0). (from proof)
But points on −g side may not have smaller f -values.
So −g may not be a descent direction. (shown in figure) Plot courtesy of .
Feb 15, 2022

Subgradient Descent
Move along the negative subgradient:
xt+1 =xt −ηg where g ∈∂f(xt) and η>0
This can increase the objective but gets us closer to the minimizer if f is convex and η is small enough:
∥xt+1 −x∗∥ < ∥xt −x∗∥ Subgradients don’t necessarily converge to zero as we get closer to x∗, so we need decreasing step sizes, e.g. O(1/t) or O(1/ t). Subgradient methods are slower than gradient descent, e.g. O(1/ε2) vs O(1/ε) for convex functions. Based on https://www.cs.ubc.ca/~schmidtm/Courses/5XX-S20/S4.pdf Feb 15, 2022 Subgradient descent for SVM (HW3) SVM objective function: Pegasos: stochastic subgradient descent with step size ηt = 1/(tλ) max􏳑0,1−yiwTxi􏳒+λ||w||2. Feb 15, 2022 Subgradient: generalize gradient for non-differentiable convex functions Subgradient “descent”: General method for non-smooth functions Simple to implement Slow to converge Feb 15, 2022 Part III: The Dual Problem In addition to subgradient descent, we can directly solve the optimization problem using a QP solver. Let’s study its dual problem to gain addition insights (which will be useful for next week!) Feb 15, 2022 41 / 66 SVM as a Quadratic Program The SVM optimization problem is equivalent to minimize 2||w|| +n subject to −ξi 􏳫 0 for i = 1,...,n 􏳑1−yi 􏳂wT xi +b􏳃􏳒−ξi 􏳫 0 Differentiable objective function n + d + 1 unknowns and 2n affine constraints. A quadratic program that can be solved by any off-the-shelf QP solver. Let’s learn more by examining the dual. for i = 1,...,n Feb 15, 2022 Why Do We Care About the Dual? Feb 15, 2022 43 / 66 The Lagrangian The general [inequality-constrained] optimization problem is: Definition minimize f0 (x ) subject to fi(x) 􏳫 0, i = 1,...,m The Lagrangian for this optimization problem is L(x,λ) = f0(x)+ λi ’s are called Lagrange multipliers (also called the dual variables). Weighted sum of the objective and constraint functions Hard constraints → soft constraints Feb 15, 2022 Definition The Lagrange dual function is g(λ)=infL(x,λ)=inf f0(x)+ λifi(x) g(λ) is concave (why?) Lower bound property: if λ ≽ 0, g (λ) 􏳫 p∗ where p∗ is the optimal value of the optimization problem. g(λ) can be −∞ (uninformative lower bound) 􏳊 􏳇m 􏳋 i=1 Feb 15, 2022 The Primal and the Dual For any primal form optimization problem, minimize f0 (x ) subject to fi(x) 􏳫 0, i = 1,...,m, there is a recipe for constructing a corresponding Lagrangian dual problem: maximize g (λ) subject to λi 􏳬 0, i = 1,...,m, The dual problem is always a convex optimization problem. The dual variables often have interesting and relevant interpretations. The dual variables provide certificate for optimality. Feb 15, 2022 Weak Duality We always have weak duality: p∗ 􏳬 d∗. Plot courtesy of . Feb 15, 2022 47 / 66 Strong Duality For some problems, we have strong duality: p∗ = d∗. For convex problems, strong duality is fairly typical. Plot courtesy of . Feb 15, 2022 48 / 66 Complementary Slackness Assume strong duality. Let x∗ be primal optimal and λ∗ be dual optimal. Then: = g(λ∗) = inf L(x,λ∗) x 􏳫 L(x∗,λ∗) (strong duality and definition) λ∗i fi(x∗) λ∗fi(x∗) must actually be 0. That is 􏳫 f0(x∗). i=1 i This condition is known as complementary slackness. Each term in sum 􏲿 λi >0 =⇒ fi(x∗)=0 and fi(x∗)<0 =⇒ λi =0 ∀i Feb 15, 2022 The SVM Dual Problem Feb 15, 2022 50 / 66 minimize subject to 􏳑1−yi 􏳂wT xi +b􏳃􏳒−ξi 􏳫 0 −ξi 􏳫 0 for i = 1,...,n for i = 1,...,n Constraint 􏳑1−yi􏳂wTxi+b􏳃􏳒−ξi 􏳫0 1 L(w,b,ξ,α,λ)= 2||w||2+n αi 􏳑1−yi 􏳂wTxi +b􏳃−ξi􏳒+ i=1 i=1 Dual optimum value: d∗ = supα,λ≽0 infw,b,ξ L(w,b,ξ,α,λ) Feb 15, 2022 Strong Duality by Slater’s Constraint Qualification The SVM optimization problem: minimize subject to −ξi 􏳫 0 for i = 1,...,n 􏳑1−yi 􏳂wT xi +b􏳃􏳒−ξi 􏳫 0 for i = 1,...,n Slater’s constraint qualification: Convex problem + affine constraints =⇒ strong duality iff problem is feasible Do we have a feasible point? For SVM, we have strong duality. Feb 15, 2022 SVM Dual Function: First Order Conditions Lagrange dual function is the inf over primal variables of L: g(α,λ)= inf L(w,b,ξ,α,λ) 􏳄1 􏳇n􏳖c 􏳗􏳇n inf w T w + ξi − αi − λi + w,b,ξ2i=1n i=1 ∂wL=0 ⇐⇒ w− αiyixi=0⇐⇒ 􏳇n ∂bL=0⇐⇒− αiyi=0⇐⇒ ∂ξiL=0 ⇐⇒ nc−αi−λi=0⇐⇒ 􏳅 αi 􏳑1 − yi 􏳂w T xi + b􏳃􏳒 α i + λ i = nc Feb 15, 2022 SVM Dual Function Substituting these conditions back into L, the second term disappears. First and third terms become 2wTw = 2 αiαjyiyjxiTxj 􏳇n 􏳇n􏳇n 􏳇n αi(1−yi 􏳂wTxi +b􏳃) = i=1 Putting it together, the dual function is αiαjyiyjxjTxi −b αiyi . 􏳎 􏳍􏳌 􏳏  n α −1 n ααyyxTx i=1αiyi=0 g(α,λ)= i=1 i 2 i,j=1 i j i j j i αi+λi=nc,alli −∞ otherwise. Feb 15, 2022 SVM Dual Problem The dual function is g(α,λ)= i=1 i 2 i,j=1 i j i j j i αi+λi=nc,alli The dual problem is supα,λ≽0g(α,λ): otherwise. αi αj yi yj xjT xi  n α −1 n ααyyxTx i=1αiyi=0 i,j=1 αiyi =0 αi +λi = nc αi,λi 􏳬0, i =1,...,n Feb 15, 2022 Insights from the Dual Problem Feb 15, 2022 56 / 66 KKT Conditions For convex problems, if Slater’s condition is satisfied, then KKT conditions provide necessary and sufficient conditions for the optimal solution. Primal feasibility: fi (x ) 􏳫 0 ∀i Dual feasibility: λ ≽ 0 Complementary slackness: λi fi (x ) = 0 First-order condition: ∂ L(x,λ)=0 ∂x Feb 15, 2022 The SVM Dual Solution We found the SVM dual problem can be written as: 􏳇n 1􏳇n sup αi − αiαjyiyjxjTxi α i=1 2i,j=1 α i ∈ 􏳞 0 , nc 􏳟 i = 1 , . . . , n . Given solution α∗ to dual, primal solution is w ∗ = 􏲿ni =1 α∗i yi xi . The solution is in the space spanned by the inputs. Note α∗i ∈ [0, nc ]. So c controls max weight on each example. (Robustness!) What’s the relation between c and regularization? Feb 15, 2022 Complementary Slackness Conditions Recall our primal constraints and Lagrange multipliers: Constraint (1−yif (xi))−ξi 􏳫 0 Recall first order condition ∇ξi L = 0 gave us λ∗i = nc − α∗i . By strong duality, we must have complementary slackness: α∗i (1−yif∗(xi)−ξ∗i )=0 λ∗ξ∗ =􏳖c −α∗􏳗ξ∗ =0 iinii Feb 15, 2022 Consequences of Complementary Slackness By strong duality, we must have complementary slackness. α∗i (1−yif∗(xi)−ξ∗i )=0 􏳖c −α∗􏳗ξ∗ =0 nii Recall “slack variable” ξ∗i =max(0,1−yif∗(xi)) is the hinge loss on (xi,yi). If yif∗(xi)>1 then the margin loss is ξ∗i =0, and we get α∗i =0.
If yif∗(xi)<1 then the margin loss is ξ∗i >0, so α∗i = nc.
If α∗i = 0, then ξ∗i = 0, which implies no loss, so yif ∗(x) 􏳬 1.
If α∗i ∈ 􏳑0, nc 􏳒, then ξ∗i = 0, which implies 1−yif ∗(xi) = 0.
Feb 15, 2022

Complementary Slackness Results: Summary
If α∗ is a solution to the dual problem, then primal solution is
Relation between margin and example weights (αi ’s):
α∗i =0 =⇒ yif∗(xi)􏳬1 α∗i ∈􏳖0,nc􏳗 =⇒ yif∗(xi)=1
α∗i=nc =⇒ yif∗(xi)􏳫1 yif∗(xi)<1 =⇒ α∗i=nc yif∗(xi)=1 =⇒ α∗i ∈􏳞0,nc􏳟 yif∗(xi)>1 =⇒ α∗i =0
α∗i yixi whereα∗i ∈[0,n].
Feb 15, 2022

Support Vectors
If α∗ is a solution to the dual problem, then primal solution is
α ∗i y i x i
The xi ’s corresponding to α∗i > 0 are called support vectors.
with α∗i ∈ [0, nc ].
Few margin errors or “on the margin” examples =⇒ sparsity in input examples.
Feb 15, 2022

The Bias Term: b
For our SVM primal, the complementary slackness conditions are:
α∗i 􏳑1−yi􏳂xiTw∗+b􏳃−ξ∗i􏳒=0 λ∗ξ∗ =􏳖c −α∗􏳗ξ∗ =0
Suppose there’s an i such that α∗i ∈ 􏳑0, nc 􏳒. (2) implies ξ∗i = 0.
(1) implies
y i 􏳂 x iT w ∗ + b ∗ 􏳃 = 1
⇐⇒ xiTw∗+b∗=yi(useyi∈{−1,1})
b∗ =yi −xiTw∗
Feb 15, 2022

The Bias Term: b
We get the same b∗ for any choice of i with α∗i ∈􏳑0,nc􏳒 b∗ =yi −xiTw∗
With numerical error, more robust to average over all eligible i’s: ∗ 􏳣T∗∗􏳖c􏳗􏳤
b=meanyi−xiw|αi∈0,n . If there are no α∗i ∈􏳑0,nc􏳒?
Then we have a degenerate SVM training problem1 (w ∗ = 0).
1See Rifkin et al.’s “A Note on Support Vector Machine Degeneracy”, an MIT AI Lab Technical Report. Feb 15, 2022

Teaser for Kernelization
Feb 15, 2022 65 / 66

Dual Problem: Dependence on x through inner products SVM Dual Problem:
sup αi − αiαjyiyjxjTxi
α i=1 2i,j=1
α i ∈ 􏳞 0 , nc 􏳟 i = 1 , . . . , n .
Note that all dependence on inputs xi and xj is through their inner product: ⟨xj , xi ⟩ = xjT xi . We can replace xjT xi by other products…
This is a “kernelized” objective function.
Feb 15, 2022 66 / 66

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com