程序代写 Linear Machines & SVM

Linear Machines & SVM
Linear Machines

Define general linear classifiers

Copyright By PowCoder代写 加微信 powcoder

Revisiting Logistic Regression
|In Logistic Regression: given a training set of n labelled samples , we learn P(y|x) by assuming a logistic sigmoid function.
➔ We end up with a linear classifier.
➔ g(x) = wtx is called the discriminant function.

Linear Discriminant Functions
|In general, taking a discriminative approach, we can assume some form for the discriminant function that defines the classifier.
➔The learning task is to use the training samples to estimate the parameters of the classifier.

Linear Decision Boundaries
|Linear discriminant functions give arise to liner decision boundaries
➔linear classifiers or linear machines |We will use both notations:
g(x) = wtx or g(x) = wtx + w0

Linear Machine for C>2 Classes
|We can define C linear discriminant functions: gi(x) = witx, i = 1, 2, …, C
|What is the decision rule for the classifier?

The Learning Task
|Finding wi, i = 1, 2, …, C
|Let’s use the 2-class case as an example
-For n samples x1, …, xn, of 2 classes 1 and 2 , if there exists a vector w such that g(x) = wtx classifies them all correctly➔Finding w
| i.e., finding w such that
wtxi ≥ 0 for samples of 1 and wtxi < 0 for samples of 2, Linear Separability |If we can find at least one vector w such that g(x) = wtx classifies all samples ➔ We say the samples are linearly separable. |An example of not linearly separable in 2-D: The Solution Region |There may be many different weight vectors that can all be valid solutions for a given training set ➔The solution regions |If the solution vector is not unique, Which one is the best? Solving for the Weight Vector |Consider the following approach: finding a solution vector which optimizes some objective function. ➔ We may introduce additional constraints for a “good” solution” ➔ Solving a constrained optimization problem. |Theoretical: Lagrange or Karush-Kuhn-Tucker. |In practice: e.g., gradient-descent-based search Gradient Descent Procedure |Basic idea: - Define a cost function J(w) - Starting from an initial weight vector w(0) - Update w by 𝐰 𝑘 + 1 = 𝐰 𝑘 − η 𝑘 𝛻𝐽 𝐰 𝑘 , |η>0 is the learning rate.

Linear Machines & SVM
The Concept of Margins

Illustrate Margins in Classifier

Illustrating Linear Boundaries
|The decision boundaries is given by the line g(x) = 0.
-For appreciating a geometric interpretation, we will write w0 explicitly, i.e., we have
g(x) = wtx + w0
|The normal vector of the decision line/plane is _________

Which one is better?
➔Consider the distances of the samples to the decision plane.

Distance to the Decision Plane
g(x) = wtx + w0 Xp

Distance to the Decision Plane
g(x) = wtx + w0
|g(x) gives an algebraic measure of the distance from x to the decision plane.

The Concept of Margins
|Let g(x) = 0 be a decision plane
-The margin of a sample x (w.r.t. the decision plane) is the
distance from x to the plane.
– For a given set of samples S, the margin (w.r.t a decision
plane) is the smallest margin over all x in S.
|For a given set, a classifier that gives rise to a larger margin will be better.

Use Margins to Compare Solutions
➔Max margin ➔SVM

Linear Machines & SVM
Linear SVM: Linearly Separable Case

Construct SVM for Linearly Separable Data

Key Idea of Support Vector Machines
|For a given set, a classifier that gives rise to a larger margin will be better.
|SVM: To find the decision boundary such that the margin is maximized.

Formulating the Problem
|Given labeled training data:
, y(i) ∈{-1,1}, x(i)∈ 𝑹d, i=1,…,n,
|Assuming the points are linearly separable, let’s write a separating hyperplane as:
H: wtx + b = 0

Formulating the Problem (cont’d)
|Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) examples.
|These defines planes H1 and H2.
|We can let d+=d-=d
➔Find a solution maximizing 2d.

Formulating the Margin
|Given separating plane H: wtx + b = 0 and distance d, what are the equations for H1 and H2?
|Consider the plane H* given by wtx + b = ||w||d
-Check its orientation -Check its distance to H

Formulating the Margin (cont’d)
|H1 is given by wtx + b = ||w||d
|Similarly, H2 is given by wtx + b = -||w||d
|Note: for any plane equation, wtx + b = 0, {w, b} is defined only up to an unknow scale:
– {sw, sb} is also a valid solution to the equation, for any constant s.

Formulating the Margin (cont’d)
➔We can have the canonical formulation for all the planes as
H: wtx + b = 0 H1: wtx + b = 1 H2: wtx + b = -1
➔The region between H1 and H2 is also called the
margin, and its width is 𝟐 |𝐰|

Formulating SVM
𝐰∗,𝑏∗ = argmin 𝐰 𝑜𝑟 𝐰,𝑏
Subject to
wtx(i) + b ≥ 1
wtx(i) +b≤-1 fory(i) =-1
The constraints can be combined into:
y(i)(wtx(i) + b) – 1≥ 0 ∀𝑖
➔A nonlinear (quadratic) optimization problem with linear inequality constraints.
𝐰∗,𝑏∗ = argmin1 𝐰 𝐰,𝑏 2
for y(i) = +1

How to solve SVM? (Outline)
|Reformulate the problem using Lagrange multipliers α
-Lagrangian Primal Problem -Lagrangian Dual Problem
|The Karush-Kuhn- -Necessary and sufficient for w, b, α.
-Solving the SVM problem➔finding a solution to the KKT conditions.

SVM: Lagrangian Primal Formulation
𝐿𝑃 𝐰,𝑏,α =1 𝐰 2−෍α𝑖[𝑦𝑖 (𝐰𝑡𝐱𝑖 +𝑏)−1]
|then the SVM solution should satisfy
𝜕𝐿𝑃 =0, 𝜕𝐰
α𝑖[𝑦𝑖 (𝐰𝑡𝐱𝑖 +𝑏)−1]=0
The final w is given by
𝐰=෍α𝑖𝑦𝑖 𝐱𝑖 𝑖
and b is given by
for any 𝑘 such that 𝛼𝑘 > 0
𝜕𝐿𝑃 =0, 𝜕b

SVM: Lagrangian Dual Formulation
|The objective function is
𝐿𝐷 𝐰,𝑏,α =෍α𝑖−1෍α𝑖α𝑗𝑦𝑖 𝑦𝑗 𝐱𝑖 ∙𝐱𝑗
|The solution is the same as before. But there is an important observation.
|Points for which αi > 0 are called support vectors

Linear Machines & SVM
SVM for Non-linearly-separable Case

Linear Separability Violated
|Some samples will always be misclassified no matter what {w,b} is used.

Examining Misclassified Samples
|They will violate the constraints:
wtx(i) + b ≥ 1 for y(i) = +1 wtx(i) +b≤-1 fory(i) =-1

Relaxing the Constraints
|Introducing non-negative slack variables ξi
wtx(i) + b ≥ 1- ξi for y(i) = +1 wtx(i) +b≤-1+ξi fory(i) =-1
|For an error to occur, the corresponding ξi must exceed unity.
-Hinge loss or soft margin.
➔∑i ξi provides an upper bound on the number of training errors.

Updating the Formulation
|C is a parameter to control how much penalty is assigned to errors.
𝐰∗,𝑏∗ =argmin1 𝐰 2 +𝐶(∑iξi) 𝐰,𝑏 2
Subject to
wtx(i) +b≥1-ξi fory(i) =+1 wtx(i) +b≤-1+ξi fory(i) =-1 ξi ≥0,∀𝑖

Are Non-linear Decision Boundaries Possible?
|Transform data to higher dimensions using a mapping
-More freedom to position the samples
-May make the samples linearly separable
-Run linear SVM in the new space➔may be equivalent to non-linear boundaries in the original space
|What mapping to use?

The Kernel Trick
|Revisit the Formulation for SVM 𝐿𝐷 𝐰,𝑏,α =෍α𝑖−1෍α𝑖α𝑗𝑦𝑖 𝑦𝑗 𝐱𝑖 ∙𝐱𝑗
|Introduce a kernel function
𝐿𝐷 𝐰,𝑏,α =෍α𝑖−1෍α𝑖α𝑗𝑦𝑖 𝑦𝑗 𝐾(𝐱𝑖 ,𝐱𝑗 ) 𝑖 2 𝑖,𝑗

The Kernel Trick (cont’d)
|Mercer’s Theorem: for a symmetric, non-negative definite kernel function satisfying some minor conditions, there exists a mapping Φ(x) such that
𝐾(𝐱𝑖 ,𝐱𝑗 )=Φ(𝐱𝑖 )∙Φ(𝐱𝑗 )
➔Using a kernel function in LD can effectively defines an implicit mapping to a higher- dimensional space, where linear SVM was run.
➔The decision boundaries in the original space can be highly non-linear.

Common Kernel Functions
|Polynomials of degree d
|Polynomials of degree up to d
|Gaussian kernels |Sigmoid kernel
𝐾 𝐱(𝑖), 𝐱(𝑗)
= 𝐱(𝑖), 𝐱(𝑗) 𝑑 = ( 𝐱(𝑖), 𝐱(𝑗)
𝐾 𝐱(𝑖), 𝐱(𝑗)
𝐱(𝑖) − 𝐱(𝑗) 𝐾𝐱(𝑖),𝐱(𝑗) =exp−
𝐾 𝐱(𝑖),𝐱(𝑗) =tanh𝜂𝐱(𝑖),𝐱(𝑗) +𝜈

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com