Data Mining (EECS 4412)
Support Vector Machines
Parke Godfrey
EECS
Lassonde School of Engineering York University
Thanks to:
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Classification: A Mathematical Mapping
n Classification: predicts categorical class labels n E.g., Personal homepage classification
n xi =(x1,x2,x3,…),yi =+1or–1
n x1 : # of word “homepage” x nx2:#ofword“welcome” x x x x
n Mathematically,xÎX=Ân,yÎY={+1,–1}, x x xx o
n Wewanttoderiveafunctionf:XàY x n Linear Classification
o o oo
n Binary Classification problem
n Data above the red line belongs to class ‘x’
n Data below red line belongs to class ‘o’
n Examples: SVM, Perceptron, Probabilistic Classifiers
o
o
o oo o o
o
3
Discriminative Classifiers
n Advantages
n Prediction accuracy is generally high
n As compared to Bayesian methods – in general
n Robust, works when training examples contain errors n Fast evaluation of the learned target function
n Bayesian networks are normally slow n Criticism
n Long training time
n Difficult to understand the learned function (weights)
n Bayesian networks can be used easily for pattern discovery
n Not easy to incorporate domain knowledge
n Easy in the form of priors on the data or distributions
4
SVM—Support Vector Machines
n A relatively new classification method for both linear and nonlinear data
n It uses a nonlinear mapping to transform the original training data into a higher dimension
n With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)
n With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane
n SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)
5
SVM—History and Applications
n Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s
n Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)
n Used for: classification and numeric prediction n Applications:
n handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests
6
SVM—General Philosophy
Small Margin Large Margin
Support Vectors
7
SVM—Margins and Support Vectors
March 10, 2021 Data Mining: Concepts and Techniques 8
SVM—When Data Is Linearly Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
9
SVM—Linearly Separable
n A separating hyperplane can be written as W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias) n For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
n The hyperplane defining the sides of the margin:
H1:w0 +w1 x1 +w2 x2 ≥1 foryi=+1,and
H2:w0 +w1 x1 +w2 x2 ≤–1foryi=–1
n Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
n This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints à Quadratic Programming (QP) à Lagrangian multipliers
10
Why Is SVM Effective on High Dimensional Data?
n The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data
n The support vectors are the essential or critical training examples — they lie closest to the decision boundary (MMH)
n If all other training examples are removed and the training is repeated, the same separating hyperplane would be found
n The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality
n Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high
11
SVM—Linearly Inseparable
A2
n Transform the original input data into a higher dimensional space
A1
n Search for a linear separating hyperplane in the new space
12
SVM: Different Kernel functions
n Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
n Typical Kernel Functions
n SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters)
13
Scaling SVM by Hierarchical Micro-Clustering
n SVM is not scalable to the number of data objects in terms of training time and memory usage
n H. Yu, J. Yang, and J. Han, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, KDD’03)
n CB-SVM (Clustering-Based SVM)
n Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the training speed
n Use micro-clustering to effectively reduce the number of points to be considered
n At deriving support vectors, de-cluster micro-clusters near “candidate vector” to ensure high classification accuracy
14
CF-Tree: Hierarchical Micro-cluster
n Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory
n Micro-clustering: Hierarchical indexing structure
n provide finer samples closer to the boundary and coarser samples farther from the boundary
15
Selective Declustering: Ensure High Accuracy
n CF tree is a suitable base structure for selective declustering n De-cluster only the cluster Ei such that
n Di – Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei
n Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary
n “Support cluster”: The cluster whose centroid is a support vector
16
CB-SVM Algorithm: Outline
n Construct two CF-trees from positive and negative data sets independently
n Need one scan of the data set
n Train an SVM from the centroids of the root entries
n De-cluster the entries near the boundary into the next level
n The children entries de-clustered from the parent entries are accumulated into the training set with the non-declustered parent entries
n Train an SVM again from the centroids of the entries in the training set
n Repeat until nothing is accumulated
17
Accuracy and Scalability on Synthetic Dataset
n Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm
18
SVM vs. Neural Network
n Neural Network
n Nondeterministic
algorithm
n Generalizes well but doesn’t have strong mathematical foundation
n Can easily be learned in incremental fashion
n To learn complex functions—use multilayer perceptron (nontrivial)
n SVM
n Deterministic algorithm
n Nice generalization properties
n Hard to learn – learned in batch mode using quadratic programming techniques
n Using kernels can learn very complex functions
19
SVM Related Links
n SVM Website: http://www.kernel-machines.org/ n Representative implementations
n LIBSVM: an efficient implementation of SVM, multi- class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc.
n SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C
n SVM-torch: another recent implementation also written in C
20
Linear Regression
n Linear regression: involves a response variable y and a single predictor variable x
y=w0 +w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
w =y-wx 01
n Multiple linear regression: involves more than one predictor variable
n Training data is of the form (X1, y1), (X2, y2),..., (X|D|, y|D|)
n Ex.For2-Ddata,wemayhave:y=w0 +w1 x1+w2 x2
n Solvable by extension of least square method or using SAS, S-Plus n Many nonlinear functions can be transformed into the above
n Method of least squares: estimates the best-fitting straight line |D|
w =å(x -x)(y -y)
ii
1 i=1
|D|
å(x -x)
2
i i=1
21
Nonlinear Regression
n Some nonlinear models can be modeled by a polynomial function
n A polynomial regression model can be transformed into linear regression model. For example,
y=w0 +w1 x+w2 x2+w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y=w0 +w1 x+w2 x2+w3 x3
n Other functions, such as power function, can also be transformed to linear model
n Some models are intractable nonlinear (e.g., sum of exponential terms)
n possible to obtain least square estimates through extensive calculation on more complex formulae
22
Other Regression-Based Models
n Generalized linear model:
n Foundation on which linear regression can be applied to modeling
categorical response variables
n Variance of y is a function of the mean value of y, not a constant
n Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables
n Poisson regression: models the data that exhibit a Poisson distribution
n Log-linear models: (for categorical data)
n Approximate discrete multidimensional prob. distributions n Also useful for data compression and smoothing
n Regression trees and model trees
n Trees to predict continuous values rather than class labels
23