Linear Machines & SVM
Linear Machines
Define general linear classifiers
Copyright By PowCoder代写 加微信 powcoder
Revisiting Logistic Regression
|In Logistic Regression: given a training set of n labelled samples
➔ We end up with a linear classifier.
➔ g(x) = wtx is called the discriminant function.
Linear Discriminant Functions
|In general, taking a discriminative approach, we can assume some form for the discriminant function that defines the classifier.
➔The learning task is to use the training samples to estimate the parameters of the classifier.
Linear Decision Boundaries
|Linear discriminant functions give arise to liner decision boundaries
➔linear classifiers or linear machines |We will use both notations:
g(x) = wtx or g(x) = wtx + w0
Linear Machine for C>2 Classes
|We can define C linear discriminant functions: gi(x) = witx, i = 1, 2, …, C
|What is the decision rule for the classifier?
The Learning Task
|Finding wi, i = 1, 2, …, C
|Let’s use the 2-class case as an example
-For n samples x1, …, xn, of 2 classes 1 and 2 , if there exists a vector w such that g(x) = wtx classifies them all correctly➔Finding w
| i.e., finding w such that
wtxi ≥ 0 for samples of 1 and wtxi < 0 for samples of 2,
Linear Separability
|If we can find at least one vector w such that g(x) = wtx classifies all samples
➔ We say the samples are linearly separable.
|An example of not linearly separable in 2-D:
The Solution Region
|There may be many different weight vectors that can all be valid solutions for a given training set
➔The solution regions
|If the solution vector is not unique, Which one is the best?
Solving for the Weight Vector
|Consider the following approach: finding a solution vector which optimizes some objective function.
➔ We may introduce additional constraints for a “good” solution”
➔ Solving a constrained optimization problem. |Theoretical: Lagrange or Karush-Kuhn-Tucker. |In practice: e.g., gradient-descent-based search
Gradient Descent Procedure
|Basic idea:
- Define a cost function J(w)
- Starting from an initial weight vector w(0) - Update w by
𝐰 𝑘 + 1 = 𝐰 𝑘 − η 𝑘 𝛻𝐽 𝐰 𝑘 ,
|η>0 is the learning rate.
Linear Machines & SVM
The Concept of Margins
Illustrate Margins in Classifier
Illustrating Linear Boundaries
|The decision boundaries is given by the line g(x) = 0.
-For appreciating a geometric interpretation, we will write w0 explicitly, i.e., we have
g(x) = wtx + w0
|The normal vector of the decision line/plane is _________
Which one is better?
➔Consider the distances of the samples to the decision plane.
Distance to the Decision Plane
g(x) = wtx + w0 Xp
Distance to the Decision Plane
g(x) = wtx + w0
|g(x) gives an algebraic measure of the distance from x to the decision plane.
The Concept of Margins
|Let g(x) = 0 be a decision plane
-The margin of a sample x (w.r.t. the decision plane) is the
distance from x to the plane.
– For a given set of samples S, the margin (w.r.t a decision
plane) is the smallest margin over all x in S.
|For a given set, a classifier that gives rise to a larger margin will be better.
Use Margins to Compare Solutions
➔Max margin ➔SVM
Linear Machines & SVM
Linear SVM: Linearly Separable Case
Construct SVM for Linearly Separable Data
Key Idea of Support Vector Machines
|For a given set, a classifier that gives rise to a larger margin will be better.
|SVM: To find the decision boundary such that the margin is maximized.
Formulating the Problem
|Given labeled training data:
|Assuming the points are linearly separable, let’s write a separating hyperplane as:
H: wtx + b = 0
Formulating the Problem (cont’d)
|Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) examples.
|These defines planes H1 and H2.
|We can let d+=d-=d
➔Find a solution maximizing 2d.
Formulating the Margin
|Given separating plane H: wtx + b = 0 and distance d, what are the equations for H1 and H2?
|Consider the plane H* given by wtx + b = ||w||d
-Check its orientation -Check its distance to H
Formulating the Margin (cont’d)
|H1 is given by wtx + b = ||w||d
|Similarly, H2 is given by wtx + b = -||w||d
|Note: for any plane equation, wtx + b = 0, {w, b} is defined only up to an unknow scale:
– {sw, sb} is also a valid solution to the equation, for any constant s.
Formulating the Margin (cont’d)
➔We can have the canonical formulation for all the planes as
H: wtx + b = 0 H1: wtx + b = 1 H2: wtx + b = -1
➔The region between H1 and H2 is also called the
margin, and its width is 𝟐 |𝐰|
Formulating SVM
𝐰∗,𝑏∗ = argmin 𝐰 𝑜𝑟 𝐰,𝑏
Subject to
wtx(i) + b ≥ 1
wtx(i) +b≤-1 fory(i) =-1
The constraints can be combined into:
y(i)(wtx(i) + b) – 1≥ 0 ∀𝑖
➔A nonlinear (quadratic) optimization problem with linear inequality constraints.
𝐰∗,𝑏∗ = argmin1 𝐰 𝐰,𝑏 2
for y(i) = +1
How to solve SVM? (Outline)
|Reformulate the problem using Lagrange multipliers α
-Lagrangian Primal Problem -Lagrangian Dual Problem
|The Karush-Kuhn- -Necessary and sufficient for w, b, α.
-Solving the SVM problem➔finding a solution to the KKT conditions.
SVM: Lagrangian Primal Formulation
𝐿𝑃 𝐰,𝑏,α =1 𝐰 2−α𝑖[𝑦𝑖 (𝐰𝑡𝐱𝑖 +𝑏)−1]
|then the SVM solution should satisfy
𝜕𝐿𝑃 =0, 𝜕𝐰
α𝑖[𝑦𝑖 (𝐰𝑡𝐱𝑖 +𝑏)−1]=0
The final w is given by
𝐰=α𝑖𝑦𝑖 𝐱𝑖 𝑖
and b is given by
for any 𝑘 such that 𝛼𝑘 > 0
𝜕𝐿𝑃 =0, 𝜕b
SVM: Lagrangian Dual Formulation
|The objective function is
𝐿𝐷 𝐰,𝑏,α =α𝑖−1α𝑖α𝑗𝑦𝑖 𝑦𝑗 𝐱𝑖 ∙𝐱𝑗
|The solution is the same as before. But there is an important observation.
|Points for which αi > 0 are called support vectors
Linear Machines & SVM
SVM for Non-linearly-separable Case
Linear Separability Violated
|Some samples will always be misclassified no matter what {w,b} is used.
Examining Misclassified Samples
|They will violate the constraints:
wtx(i) + b ≥ 1 for y(i) = +1 wtx(i) +b≤-1 fory(i) =-1
Relaxing the Constraints
|Introducing non-negative slack variables ξi
wtx(i) + b ≥ 1- ξi for y(i) = +1 wtx(i) +b≤-1+ξi fory(i) =-1
|For an error to occur, the corresponding ξi must exceed unity.
-Hinge loss or soft margin.
➔∑i ξi provides an upper bound on the number of training errors.
Updating the Formulation
|C is a parameter to control how much penalty is assigned to errors.
𝐰∗,𝑏∗ =argmin1 𝐰 2 +𝐶(∑iξi) 𝐰,𝑏 2
Subject to
wtx(i) +b≥1-ξi fory(i) =+1 wtx(i) +b≤-1+ξi fory(i) =-1 ξi ≥0,∀𝑖
Are Non-linear Decision Boundaries Possible?
|Transform data to higher dimensions using a mapping
-More freedom to position the samples
-May make the samples linearly separable
-Run linear SVM in the new space➔may be equivalent to non-linear boundaries in the original space
|What mapping to use?
The Kernel Trick
|Revisit the Formulation for SVM 𝐿𝐷 𝐰,𝑏,α =α𝑖−1α𝑖α𝑗𝑦𝑖 𝑦𝑗 𝐱𝑖 ∙𝐱𝑗
|Introduce a kernel function
𝐿𝐷 𝐰,𝑏,α =α𝑖−1α𝑖α𝑗𝑦𝑖 𝑦𝑗 𝐾(𝐱𝑖 ,𝐱𝑗 ) 𝑖 2 𝑖,𝑗
The Kernel Trick (cont’d)
|Mercer’s Theorem: for a symmetric, non-negative definite kernel function satisfying some minor conditions, there exists a mapping Φ(x) such that
𝐾(𝐱𝑖 ,𝐱𝑗 )=Φ(𝐱𝑖 )∙Φ(𝐱𝑗 )
➔Using a kernel function in LD can effectively defines an implicit mapping to a higher- dimensional space, where linear SVM was run.
➔The decision boundaries in the original space can be highly non-linear.
Common Kernel Functions
|Polynomials of degree d
|Polynomials of degree up to d
|Gaussian kernels |Sigmoid kernel
𝐾 𝐱(𝑖), 𝐱(𝑗)
= 𝐱(𝑖), 𝐱(𝑗) 𝑑 = ( 𝐱(𝑖), 𝐱(𝑗)
𝐾 𝐱(𝑖), 𝐱(𝑗)
𝐱(𝑖) − 𝐱(𝑗) 𝐾𝐱(𝑖),𝐱(𝑗) =exp−
𝐾 𝐱(𝑖),𝐱(𝑗) =tanh𝜂𝐱(𝑖),𝐱(𝑗) +𝜈
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com