代写代考 PowerPoint Presentation

PowerPoint Presentation

Lecture 6 Support Vector Machine
Instructor:

Homework Assignments (60%)
Worksheets
Programming assignments
Project Assignment (15%)
Each team might have up to three students
Final Exam (25%)

Outline of this lecture
Classification: SVM

Optimization (gradient descent)

Robust Linear SVM & Alternatives

Kernel SVM

Implementation/Case Study

SVM in Python (Binary Classification)
package sklearn (link)
import sklearn.svm as svm

model = svm.SVC(kernel=’linear’, C=c_value)
model.fit(X=x_train, y=y_train)

model.predict(X)
X: n_samples by n_features
return: name_samples by 1; binary values: +1 or -1

Outline of this lecture
Classification: SVM

Optimization (gradient descent)

Robust Linear SVM & Alternatives

Kernel SVM

Implementation/Case Study

Problem to Solve
Let’s approach the Two-class classification problem in a direct way
Consider a two-dimensional space (x1,x2). Let’s try to find a line to separate the two classes, i.e. a separating line

Separating hyperplane
In 3D space, to separate classes, we need a plane.

In 4D or more space, we need a hyperplane

Hyperplane: mathematical definitions
A hyperplane in p dimensions is a flat affine subspace of dimension p-1
In general, the equation for a hyperplane has the form

If a hyperplane is a line;
If , a hyperplane is a plane;
If , the hyperplane goes through the origin; otherwise not.

The vector ) is called the normal vector,
It points at a direction orthogonal to the surface of a hyperplane (e.g. a line)
Any point , if then, x is lying on the hyperplane;

Separating Hyperplanes
Suppose we have a set of points in 2-D space, we like to separate the classes.

Let , for any point
If, points right on the separating line;
If , the point is on one side of the separating line, masked as blue;
If , the point is on the other side, marked as red ;

if Of the following data point, which are not belonging to the same class?
A: (0.8 0.9)
B: (0.1 0.5)
C: ( 2, 2)
if Of the following data point, which are not belonging to the same class?
A: (0.8 0.9)
B: (0.1 0.5)
C: ( 2, 2)

for all points

SVM: Linear Discriminant Function
g(x) is a linear discriminant function:

where x is a feature vector of d-dimension, w is a vector of d parameters, b is a bias scalar.

g(x) is considered as a hyper-plane in the feature space
The (Unit-length) normal vector of the hyper-plane is, by definition,

wT x + b = 0
wT x + b < 0 wT x + b > 0

How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function
denotes +1
denotes -1

Infinite number of answers!

How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1

Infinite number of answers!

How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1

Infinite number of answers!

How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1

Infinite number of answers!
Which one is the best?

Linear Max-Margin Classifier

“safe zone”
The linear discriminant function (classifier) with the maximum margin is the best
Margin is defined as the width that the boundary could be increased by before hitting a data point
Why it is the best?
Robust to outliners and thus strong generalization ability

denotes +1
denotes -1

Linear Max-Margin Classifier

Given a set of data points:
To obtain a safer margin, the above is equivalent to
denotes +1
denotes -1

Linear Max-Margin Classifier

We know that
The margin width is:
denotes +1
denotes -1

wT x + b = 0
wT x + b = -1
wT x + b = 1

Support Vectors

Linear Max-Margin Classifier

Formulation:
denotes +1
denotes -1

wT x + b = 0
wT x + b = -1
wT x + b = 1

Linear Max-Margin Classifier

Formulation:
denotes +1
denotes -1

wT x + b = 0
wT x + b = -1
wT x + b = 1

Linear Max-Margin Classifier

Formulation:
denotes +1
denotes -1

wT x + b = 0
wT x + b = -1
wT x + b = 1

Linear Max-Margin Classifier
Formulation:

The above formula is called Linear Hard-SVM (support vector machine), because it requires all samples to satisfy the constraints.

Linear Hard SVM
Given a set of training samples , we aim to solve a separating hyperplane () by

subject to

Challenge to Hard SVM
Linear Separate Hyper-plane does not always work!

Challenge ½: non-separable

The data from two classes are not separable by a linear boundary
Often the case unless number of data is less than number dimension

Challenge 2/2: noisy Data
Sometimes the data are separable, but noisy.
Poor solution for the maximal-margin classifier

Solution: soft margin
Relax the margin constraints on training samples

where is a slack variable;

SVM with soft-margin

subject to

C is the free parameter to be empirically set

Unconstrained SVM formula
SVM with soft margin:

For one training sample
If a point is outside the margin boundary (+1/-1), i.e., , there is no loss at all;
Otherwise, we still hope to minimize its distance to the margin boundary, i.e. =1-
Unconstrained min-max based SVM formula

Gradient Descent method will work as well
Can be solved efficiently by quadratic programming toolbox

Regularization Parameter C
Most important parameter in SVM
Larger C, more tolerance over noises
Leading to a hyperplane with a smaller margin
All (or almost) the training samples classified correctly
Smaller C, more strict over noises
Leading to a hyperplane with a large margin
Misclassify more training samples

Beyond Soft Margin
Can the magic C solve all problems (i.e. no separable, noise)?
NO Linear Boundary that can separate the classes

Solution 1: Feature Expansion
Enlarge the number of features, including polynomial transformations,
Fit a classifier in the enlarged space
This results in a non-linear decision boundaries in the original space.
Suppose we use , the hyperplane function would be of the form

which is a nonlinear function in the space ()

Example: Polynomial Expansion

From () to ()

Solution 2: Nonlinearities
Project feature points from linear space into nonlinear space in order to make them being separable from each other.

Introduced later

Outline of this lecture
Another perspective of SVM
Robust Linear SVM & Alternatives
Optimization (gradient descent)
Kernel SVM
Implementation

Another perspective of SVM

Optimization (gradient descent)

Kernel SVM

Implementation

Formulations
Objective function of SVM

where C is a constant, n is the number of training samples, w is a vector of model parameters and b is a bias term.

Optimization (optional)
Linear SVM Model

Gradient Descent Method
Initialize
Iterate until convergence{
Step along gradient direction: ;

Sub-gradients

Let F(w) is not differentiable everywhere
Sub-gradient: a gradient of function f at x, that defines a linear lower bound for function f that equals it at x
For hinge loss function
If , gradient is 0 ( no loss for a correct prediction)
If, gradient, w.r.t w will be :

Given a set of training samples , we aim to solve a hyperplane to separate the classes:

subject to

or equivalently

If a training sample obeys the constraint, there is no loss at all
We hope to minimize the accumulated loss over all training samples

Alternatives to SVM : Logistic Regressions

Logistic Regression and SVM
Both are designed for binary classification problems

One-vs-All to deal with multiple classes

Rewriting SVM’s loss function
SVM Formula with , let =[w, b],

Let , SVM has the following form

Review: Logistic Regression

SVM v.s. Logistic Regression: Loss

Logistic regression v.s. SVMs
When classes are (nearly) separable, SVM does better than LR; Otherwise, LR and SVM are very similar;
LR fits for estimating probabilities;
To estimate nonlinear boundaries, kernel SVM are popular; LR can be used with Kernel as well but the computations are more expensive.

Outline of this lecture
Another perspective of SVM

Optimization (gradient descent)

Robust Linear SVM & Alternatives

Kernel SVM

Implementation

Another perspective of SVM

Optimization (gradient descent)

Kernel SVM

Implementation

Kernel tricks for Nonlinear Boundary
Nonlinear boundary
e.g. Polynomial feature expansions
if predict Red for x;

Quiz: non-linear or linear?
Suppose there is only one labeled data point.
Classifier: samples with distance v < b are positive samples; negative samples otherwise. Nonlinear boundary linear discriminative function We have a linear discriminative function where , hyperplane function Nonlinear boundary linear discriminative function linear discriminative function A compact form Nonlinear boundary With two labeled points and the same criteria For any feature point x, calculate its similarities to every landmark, ; Solve a hyper boundary , then it is “+1”; otherwise, it is ‘-1’; linear discriminative function (two classes) Idea: compare the distance of A to blue points and that to red points. For a sample with the similarity vector (), if then it is “+1”; otherwise, it is ‘-1’; An Intuitive Solution: similarities To predict the color of A, Suppose , hyperplane function if a point is close to one red point, e.g. , then >0
if a point is far away from all three points, e.g. , then, <0 Kernel: similarities Describe a feature vector as its similarities to a set of landmarks feature vectors (, Similarity vector : if are almost same, ; If are far from each other Radial Basis function Kernel (Gaussian) Quiz: How to choose Landmarks? Of the following three landmarks, which is less informative? To select landmark vectors being close to boundaries; Kernel: linear support model Elegant and controlled way to introduce nonlinearities in support-vector classifiers – through the use of kernels Linear Model Linear Support Model n is the number of training samples; most of would be zero but support vectors; SVM with Kernels Linear SVM Model Kernel SVM Compute Gram matrix: M(i,j)=indicates i-th column ; Kernel SVM: Hyper-parameters Larger C, more noises (lower bias, high variance); Smaller C, less noises (higher bias, low variance) Kernel Types for Polynomial kernel: Chi-square kernel: Cosine kernel: Parameters for kernels: e.g. Large : feature vary more smoothly. Small : vary less smoothly; Outline of this lecture Another perspective of SVM Optimization (gradient descent) Robust Linear SVM & Alternatives Kernel SVM Implementation Another perspective of SVM Optimization (gradient descent) Kernel SVM Implementation Python Modules package sklearn (link) import sklearn.svm as svm model = svm.SVC(kernel='linear', C=c_value) model.fit(X=x_train, y=y_train) model.predict(X) X: n_samples by n_features return: name_samples by 1; binary values: +1 or -1 Case 1: SVM for generated data Step 1: Generating samples of two categories Case 1: SVM for generated data Step 1: Generating samples of two categories Step 2: Train a linear SVM model class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000) Case 1: SVM for generated data Step 1: Generating samples of two categories Step 2: Train a linear SVM model Step 3: Visualize the linear discriminant function Case 2: Linear SVM+Polynomial features Step 1: Generate samples of two categories Step 2: Train SVM over expanded features Case 2: Linear SVM+Polynomial features Step 1: Generate samples of two categories Step 2: Train SVM over expanded features Step 3: Visualize the model’s decision boundary Case 3: Kernel SVM for Toy Data Step-1: Load a toy dataset of two classes Step-2: Train a Kernel SVM with RBF kernel. Case 3: Kernel SVM for Toy Data Step-1: Load a toy dataset of two classes Case 3: Kernel SVM for Toy Data Step-1: Load a toy dataset of two classes Step-2: Train a Kernel SVM with RBF kernel. Case 3: Kernel SVM for Toy Data Step-1: Load a toy dataset of two classes Step-2: Train a Kernel SVM with RBF kernel Step-3: Highlight the support vectors Case 3: Kernel SVM for Toy Data Step-1: Load a toy dataset of two classes Step-2: Train a Kernel SVM with RBF kernel Step-3: Highlight the support vectors Step-4: Train and visualize SVM models with different C and Kernel parameters. Case 4: SVM for cancer data Step-1: load and split the dataset Case 4: SVM for cancer data Step-1: load and split the dataset Step-2: train a linear SVM model Step-3: Evaluate on training and testing samples Case 4: SVM for cancer data Step-4: Train a SVM Model with processed data Case 1: SVM for generated data Case 2: Linear SVM with Polynomial features Case 3: Kernel SVM for Toy Data Case 4: SVM for cancer data (with/without data processing) Conclusion Another perspective of SVM Optimization (gradient descent) Robust Linear SVM & Alternatives Kernel SVM Implementation Solving the Optimization Problem Quadratic programming with linear constraints Lagrangian Solving the Optimization Problem Solving the Optimization Problem Lagrangian Dual Solving the Optimization Problem The solution has the form: From KKT condition, we know: Thus, only support vectors have wT x + b = 0 wT x + b = -1 wT x + b = 1 Support Vectors Solving the Optimization Problem The linear discriminant function is: Notice it relies on a dot product between the test point x and the support vectors xi Also keep in mind that solving the optimization problem involved computing the dot products xiTxj between all pairs of training points Large Margin Linear Classifier Formulation: (Lagrangian Dual Problem) Large Margin Linear Classifier What if data is not linear separable? (noisy data, outliers, etc.) Slack variables ξi can be added to allow mis-classification of difficult or noisy data points denotes +1 denotes -1 wT x + b = 0 wT x + b = -1 wT x + b = 1 Large Margin Linear Classifier Formulation: Parameter C can be viewed as a way to control over-fitting. Large Margin Linear Classifier What if data is not linear separable? (noisy data, outliers, etc.) Slack variables ξi can be added to allow mis-classification of difficult or noisy data points denotes +1 denotes -1 wT x + b = 0 wT x + b = -1 wT x + b = 1 Large Margin Linear Classifier Formulation: Parameter C can be viewed as a way to control over-fitting. {(,)}, 1,2,, 00.10.20.30.40.50.60.70.80.91 00.10.20.30.40.50.60.70.80.91 minimize (,,)()1 get from ()10, where is support vector /docProps/thumbnail.jpeg 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts