PowerPoint Presentation
Lecture 6 Support Vector Machine
Instructor:
Copyright By PowCoder代写 加微信 powcoder
Homework Assignments (60%)
Worksheets
Programming assignments
Project Assignment (15%)
Each team might have up to three students
Final Exam (25%)
Outline of this lecture
Classification: SVM
Optimization (gradient descent)
Robust Linear SVM & Alternatives
Kernel SVM
Implementation/Case Study
SVM in Python (Binary Classification)
package sklearn (link)
import sklearn.svm as svm
model = svm.SVC(kernel=’linear’, C=c_value)
model.fit(X=x_train, y=y_train)
model.predict(X)
X: n_samples by n_features
return: name_samples by 1; binary values: +1 or -1
Outline of this lecture
Classification: SVM
Optimization (gradient descent)
Robust Linear SVM & Alternatives
Kernel SVM
Implementation/Case Study
Problem to Solve
Let’s approach the Two-class classification problem in a direct way
Consider a two-dimensional space (x1,x2). Let’s try to find a line to separate the two classes, i.e. a separating line
Separating hyperplane
In 3D space, to separate classes, we need a plane.
In 4D or more space, we need a hyperplane
Hyperplane: mathematical definitions
A hyperplane in p dimensions is a flat affine subspace of dimension p-1
In general, the equation for a hyperplane has the form
If a hyperplane is a line;
If , a hyperplane is a plane;
If , the hyperplane goes through the origin; otherwise not.
The vector ) is called the normal vector,
It points at a direction orthogonal to the surface of a hyperplane (e.g. a line)
Any point , if then, x is lying on the hyperplane;
Separating Hyperplanes
Suppose we have a set of points in 2-D space, we like to separate the classes.
Let , for any point
If, points right on the separating line;
If , the point is on one side of the separating line, masked as blue;
If , the point is on the other side, marked as red ;
if Of the following data point, which are not belonging to the same class?
A: (0.8 0.9)
B: (0.1 0.5)
C: ( 2, 2)
if Of the following data point, which are not belonging to the same class?
A: (0.8 0.9)
B: (0.1 0.5)
C: ( 2, 2)
for all points
SVM: Linear Discriminant Function
g(x) is a linear discriminant function:
where x is a feature vector of d-dimension, w is a vector of d parameters, b is a bias scalar.
g(x) is considered as a hyper-plane in the feature space
The (Unit-length) normal vector of the hyper-plane is, by definition,
wT x + b = 0
wT x + b < 0
wT x + b > 0
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1
Infinite number of answers!
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1
Infinite number of answers!
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1
Infinite number of answers!
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1
Infinite number of answers!
Which one is the best?
Linear Max-Margin Classifier
“safe zone”
The linear discriminant function (classifier) with the maximum margin is the best
Margin is defined as the width that the boundary could be increased by before hitting a data point
Why it is the best?
Robust to outliners and thus strong generalization ability
denotes +1
denotes -1
Linear Max-Margin Classifier
Given a set of data points:
To obtain a safer margin, the above is equivalent to
denotes +1
denotes -1
Linear Max-Margin Classifier
We know that
The margin width is:
denotes +1
denotes -1
wT x + b = 0
wT x + b = -1
wT x + b = 1
Support Vectors
Linear Max-Margin Classifier
Formulation:
denotes +1
denotes -1
wT x + b = 0
wT x + b = -1
wT x + b = 1
Linear Max-Margin Classifier
Formulation:
denotes +1
denotes -1
wT x + b = 0
wT x + b = -1
wT x + b = 1
Linear Max-Margin Classifier
Formulation:
denotes +1
denotes -1
wT x + b = 0
wT x + b = -1
wT x + b = 1
Linear Max-Margin Classifier
Formulation:
The above formula is called Linear Hard-SVM (support vector machine), because it requires all samples to satisfy the constraints.
Linear Hard SVM
Given a set of training samples , we aim to solve a separating hyperplane () by
subject to
Challenge to Hard SVM
Linear Separate Hyper-plane does not always work!
Challenge ½: non-separable
The data from two classes are not separable by a linear boundary
Often the case unless number of data is less than number dimension
Challenge 2/2: noisy Data
Sometimes the data are separable, but noisy.
Poor solution for the maximal-margin classifier
Solution: soft margin
Relax the margin constraints on training samples
where is a slack variable;
SVM with soft-margin
subject to
C is the free parameter to be empirically set
Unconstrained SVM formula
SVM with soft margin:
For one training sample
If a point is outside the margin boundary (+1/-1), i.e., , there is no loss at all;
Otherwise, we still hope to minimize its distance to the margin boundary, i.e. =1-
Unconstrained min-max based SVM formula
Gradient Descent method will work as well
Can be solved efficiently by quadratic programming toolbox
Regularization Parameter C
Most important parameter in SVM
Larger C, more tolerance over noises
Leading to a hyperplane with a smaller margin
All (or almost) the training samples classified correctly
Smaller C, more strict over noises
Leading to a hyperplane with a large margin
Misclassify more training samples
Beyond Soft Margin
Can the magic C solve all problems (i.e. no separable, noise)?
NO Linear Boundary that can separate the classes
Solution 1: Feature Expansion
Enlarge the number of features, including polynomial transformations,
Fit a classifier in the enlarged space
This results in a non-linear decision boundaries in the original space.
Suppose we use , the hyperplane function would be of the form
which is a nonlinear function in the space ()
Example: Polynomial Expansion
From () to ()
Solution 2: Nonlinearities
Project feature points from linear space into nonlinear space in order to make them being separable from each other.
Introduced later
Outline of this lecture
Another perspective of SVM
Robust Linear SVM & Alternatives
Optimization (gradient descent)
Kernel SVM
Implementation
Another perspective of SVM
Optimization (gradient descent)
Kernel SVM
Implementation
Formulations
Objective function of SVM
where C is a constant, n is the number of training samples, w is a vector of model parameters and b is a bias term.
Optimization (optional)
Linear SVM Model
Gradient Descent Method
Initialize
Iterate until convergence{
Step along gradient direction: ;
Sub-gradients
Let F(w) is not differentiable everywhere
Sub-gradient: a gradient of function f at x, that defines a linear lower bound for function f that equals it at x
For hinge loss function
If , gradient is 0 ( no loss for a correct prediction)
If, gradient, w.r.t w will be :
Given a set of training samples , we aim to solve a hyperplane to separate the classes:
subject to
or equivalently
If a training sample obeys the constraint, there is no loss at all
We hope to minimize the accumulated loss over all training samples
Alternatives to SVM : Logistic Regressions
Logistic Regression and SVM
Both are designed for binary classification problems
One-vs-All to deal with multiple classes
Rewriting SVM’s loss function
SVM Formula with , let =[w, b],
Let , SVM has the following form
Review: Logistic Regression
SVM v.s. Logistic Regression: Loss
Logistic regression v.s. SVMs
When classes are (nearly) separable, SVM does better than LR; Otherwise, LR and SVM are very similar;
LR fits for estimating probabilities;
To estimate nonlinear boundaries, kernel SVM are popular; LR can be used with Kernel as well but the computations are more expensive.
Outline of this lecture
Another perspective of SVM
Optimization (gradient descent)
Robust Linear SVM & Alternatives
Kernel SVM
Implementation
Another perspective of SVM
Optimization (gradient descent)
Kernel SVM
Implementation
Kernel tricks for Nonlinear Boundary
Nonlinear boundary
e.g. Polynomial feature expansions
if predict Red for x;
Quiz: non-linear or linear?
Suppose there is only one labeled data point.
Classifier: samples with distance v < b are positive samples; negative samples otherwise.
Nonlinear boundary
linear discriminative function
We have a linear discriminative function
where , hyperplane function
Nonlinear boundary
linear discriminative function
linear discriminative function
A compact form
Nonlinear boundary
With two labeled points and the same criteria
For any feature point x, calculate its similarities to every landmark, ;
Solve a hyper boundary ,
then it is “+1”; otherwise, it is ‘-1’;
linear discriminative function (two classes)
Idea: compare the distance of A to blue points and that to red points.
For a sample with the similarity vector (), if
then it is “+1”; otherwise, it is ‘-1’;
An Intuitive Solution: similarities
To predict the color of A,
Suppose , hyperplane function
if a point is close to one red point, e.g. , then >0
if a point is far away from all three points, e.g. , then, <0
Kernel: similarities
Describe a feature vector as its similarities to a set of landmarks feature vectors (,
Similarity vector :
if are almost same, ;
If are far from each other
Radial Basis function Kernel (Gaussian)
Quiz: How to choose Landmarks?
Of the following three landmarks, which is less informative?
To select landmark vectors being close to boundaries;
Kernel: linear support model
Elegant and controlled way to introduce nonlinearities in support-vector classifiers – through the use of kernels
Linear Model
Linear Support Model
n is the number of training samples;
most of would be zero but support vectors;
SVM with Kernels
Linear SVM Model
Kernel SVM
Compute Gram matrix: M(i,j)=indicates i-th column ;
Kernel SVM: Hyper-parameters
Larger C, more noises (lower bias, high variance);
Smaller C, less noises (higher bias, low variance)
Kernel Types for
Polynomial kernel:
Chi-square kernel:
Cosine kernel:
Parameters for kernels: e.g.
Large : feature vary more smoothly.
Small : vary less smoothly;
Outline of this lecture
Another perspective of SVM
Optimization (gradient descent)
Robust Linear SVM & Alternatives
Kernel SVM
Implementation
Another perspective of SVM
Optimization (gradient descent)
Kernel SVM
Implementation
Python Modules
package sklearn (link)
import sklearn.svm as svm
model = svm.SVC(kernel='linear', C=c_value)
model.fit(X=x_train, y=y_train)
model.predict(X)
X: n_samples by n_features
return: name_samples by 1; binary values: +1 or -1
Case 1: SVM for generated data
Step 1: Generating samples of two categories
Case 1: SVM for generated data
Step 1: Generating samples of two categories
Step 2: Train a linear SVM model
class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True,
intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)
Case 1: SVM for generated data
Step 1: Generating samples of two categories
Step 2: Train a linear SVM model
Step 3: Visualize the linear discriminant function
Case 2: Linear SVM+Polynomial features
Step 1: Generate samples of two categories
Step 2: Train SVM over expanded features
Case 2: Linear SVM+Polynomial features
Step 1: Generate samples of two categories
Step 2: Train SVM over expanded features
Step 3: Visualize the model’s decision boundary
Case 3: Kernel SVM for Toy Data
Step-1: Load a toy dataset of two classes
Step-2: Train a Kernel SVM with RBF kernel.
Case 3: Kernel SVM for Toy Data
Step-1: Load a toy dataset of two classes
Case 3: Kernel SVM for Toy Data
Step-1: Load a toy dataset of two classes
Step-2: Train a Kernel SVM with RBF kernel.
Case 3: Kernel SVM for Toy Data
Step-1: Load a toy dataset of two classes
Step-2: Train a Kernel SVM with RBF kernel
Step-3: Highlight the support vectors
Case 3: Kernel SVM for Toy Data
Step-1: Load a toy dataset of two classes
Step-2: Train a Kernel SVM with RBF kernel
Step-3: Highlight the support vectors
Step-4: Train and visualize SVM models with different C and Kernel parameters.
Case 4: SVM for cancer data
Step-1: load and split the dataset
Case 4: SVM for cancer data
Step-1: load and split the dataset
Step-2: train a linear SVM model
Step-3: Evaluate on training and testing samples
Case 4: SVM for cancer data
Step-4: Train a SVM Model with processed data
Case 1: SVM for generated data
Case 2: Linear SVM with Polynomial features
Case 3: Kernel SVM for Toy Data
Case 4: SVM for cancer data
(with/without data processing)
Conclusion
Another perspective of SVM
Optimization (gradient descent)
Robust Linear SVM & Alternatives
Kernel SVM
Implementation
Solving the Optimization Problem
Quadratic programming
with linear constraints
Lagrangian
Solving the Optimization Problem
Solving the Optimization Problem
Lagrangian Dual
Solving the Optimization Problem
The solution has the form:
From KKT condition, we know:
Thus, only support vectors have
wT x + b = 0
wT x + b = -1
wT x + b = 1
Support Vectors
Solving the Optimization Problem
The linear discriminant function is:
Notice it relies on a dot product between the test point x and the support vectors xi
Also keep in mind that solving the optimization problem involved computing the dot products xiTxj between all pairs of training points
Large Margin Linear Classifier
Formulation: (Lagrangian Dual Problem)
Large Margin Linear Classifier
What if data is not linear separable? (noisy data, outliers, etc.)
Slack variables ξi can be added to allow mis-classification of difficult or noisy data points
denotes +1
denotes -1
wT x + b = 0
wT x + b = -1
wT x + b = 1
Large Margin Linear Classifier
Formulation:
Parameter C can be viewed as a way to control over-fitting.
Large Margin Linear Classifier
What if data is not linear separable? (noisy data, outliers, etc.)
Slack variables ξi can be added to allow mis-classification of difficult or noisy data points
denotes +1
denotes -1
wT x + b = 0
wT x + b = -1
wT x + b = 1
Large Margin Linear Classifier
Formulation:
Parameter C can be viewed as a way to control over-fitting.
{(,)}, 1,2,,
00.10.20.30.40.50.60.70.80.91
00.10.20.30.40.50.60.70.80.91
minimize (,,)()1
get from ()10,
where is support vector
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com