CSCI 5250: Information Retrieval and Search Engine
Lecture 9: Large Scale
Support Vector Machines
1
CMSC5741 Big Data Tech. & Apps.
Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
1
Motivation
Introduce the widely used classification tool: Support Vector Machine (SVM)
Understand the model and parameter estimation method in terms of big data
2
Motivation
3
Motivation
4
What if there are millions of photos, how to make the SVM training scalable?
Outline
Support Vector Machines
History
Linear Separable SVMs
Non-linear Separable SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
5
Outline
Support Vector Machines
History
Linear Separable SVMs
Non-linear Separable SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
6
SVMs: History
SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became rather popular since.
Theoretically well motivated algorithm: developed from Statistical Learning Theory (Vapnik & Chervonenkis) since the 60s.
Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition, . . . )
7
SVMs: History
Centralized website: www.kernel-machines.org.
Several textbooks, e.g. “An introduction to Support Vector Machines” by Cristianini and Shawe-Taylor is one.
A large and diverse community work on them: from machine learning, optimization, statistics, neural networks, functional analysis, etc.
8
Outline
Support Vector Machines
History
Linear SVMs
Non-linear SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
9
Linear SVMs
10
Data
Training examples:
Each
Want to find a hyperplane
to separate “+” from “-”
What’s the best hyperplane defined by ?
Largest Margin
Distance from the separating hyperplance corresponds to the “confidence” of prediction
Example: We have more confidence to say A and B belong to “+” than C
11
Largest Margin
Support Vectors: Examples closest to the hyperplane
Margin : width of separation between support vectors of classes.
12
r
ρ
x
x′
w
Support vector
Largest Margin
Distance from example to the separator is :
Proof:
13
r
ρ
x
x′
w
+b=0,
||w||
||w||: the length of w vector, which is L2 norm = square root of the inner product of the vector itself.
13
Largest Margin
Assume that all data is at least distance 1 from the hyperplane, then the following constraints follow for a training set
For support vectors, the inequality becomes an equality
Recall that
Margin is:
14
Linear SVMs
Note that we assume that all data points are linearly separated by the hyperplane.
The margin is invariant to scaling of parameters.
i.e. by changing w, b to 5w, 5b, the margin doesn’t change
15
Linear SVMs
Maximize the margin
Good according to intuition, theory (VC dimension) & practice
The problem of linear SVMs is formulated as:
An equivalent form is:
16
VC dimension (for Vapnik Chervonenkis dimension) (Vapnik and Chervonenkis (1968, 1971), Vapnik (1979)) measures the capacity of a hypothesis space. Capacity is a measure of complexity and measures the expressive power, richness or flexibility of a set of functions by assessing how wiggly its members can be. Sewell (2006)
“Let us say we have a dataset containing N points. These N points can be labeled in 2N ways as positive and negative. Therefore, 2N different learning problems can be defined by N data points. If for any of these problems, we can find a hypothesis h ∈ H that separates the positive examples from the negative, then we say H shatters N points. That is, any learning problem definable by N examples can be learned with no error by a hypothesis drawn from H. The maximum number of points that can be shattered by H is called the Vapnik-Chervonekis (VC) dimension of H, is denoted as VC(H), and measures the capacity of the hypothesis class H.”
16
Outline
Support Vector Machines
History
Linear Separable SVMs
Non-linear Separable SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
17
Non-Linear Separable SVMs
In reality, training samples are usually not linearly separable.
Soft Margin Classification
Idea: allow errors but introduce slack variable to penalize errors
Still try to minimize training set errors, and to place hyperplane “far” from each class (large margin)
18
Greek letter ξ: xi
18
Soft Margin Classification
The problem becomes:
Minimize plus the number of training mistakes
Set C using cross validation
19
Soft Margin Classification
If point is on the wrong side of the margin then get penalty
Thus all mistakes are not equally bad!
20
Slack Penalty C
What is the role of penalty C:
: can set to anything, then w=0 (basically ignore the data)
: Only want w, b to separate the data
21
Soft Margin Classification
SVM in the “natural” form
SVM uses “Hinge Loss”:
22
Margin
Empirical loss L
Regularization Parameter
SVM in the “natural” form with Hinge Loss can be solved the object function is convex. Solving 0/1 loss has 2n possibilities, which is an NP-complete problem.
22
In-class Practice
Go to practice
23
Outline
Support Vector Machines
History
Linear SVMs
Non-linear SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
24
Non-linear Separable SVMs
Linear classifiers aren’t complex enough sometimes.
Map data into a richer feature space including non-linear features
Then construct a hyperplane in that space so all other equations are the same
25
Non-linear Separable SVMs
Formally, process the data with:
Then learn the map from to
26
Example: Polynomial Mapping
27
0
x
0
x2
x
Example: Polynomial Mapping
28
Example: MNIST
Data: 60,000 training examples, 10000 test examples, 28×28
Linear SVM has around 8.5% test error. Polynomial SVM has around 1% test error.
29
MINST Results
Choosing a good mapping (encoding prior knowledge + getting right complexity of function class) for your problem improves results.
30
SVMs: Kernel Trick
The Representer theorem (Kimeldorf & Wahba, 1971) shows that (for SVMs as a special case):
for some variables , instead of optimizing directly, we can optimize .
The decision rule is:
We call the kernel function.
31
Greek letter : phi
31
Kernels
Why kernels?
Make non-separable problem separable.
Map data into better representational space
Common used kernels
Linear
Polynomial
Gives feature conjunctions
Radial basis function
32
Outline
Support Vector Machines
History
Linear Separable SVMs
Non-linear Separable SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
33
SVM: How to Estimate w, b
We take the soft margin classification for example:
Standard way: Use a solver!
Solver: software for finding solutions to “common” optimization problems, e.g. LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
Problems: Solvers are inefficient for big data!
34
SVM: How to Estimate w, b
Want to estimate w, b !
Alternative approach:
Want to minimize f(w,b)
How to minimize convex functions f(z)
Use gradient descent:
Iterate:
35
SVM: How to Estimate w?
Want to minimize f(w,b):
Compute the gradient
36
Empirical loss L
SVM: How to Estimate w?
Gradient descent:
Problem:
Computing takes O(n) time
n … size of the training dataset
37
The sum-minimization problem also arises for empirical risk minimization: In this case, is the value of the loss function at i-th example, and is the empirical risk.
When used to minimize the above function, a standard (or “batch”) gradient descent method would perform the iterations shown in this slide, where is a step size (sometimes called the learning rate in machine learning).
In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.
However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions’ gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems. We show another approach in the next slide.
37
SVM: How to Estimate w?
Stochastic Gradient Descent
Instead of evaluating gradient over all examples, evaluate it for each individual training example
Stochastic gradient descent (SGD):
38
)
In stochastic (or “on-line”) gradient descent, the true gradient of is approximated by a gradient at a single example: w(j) w(j) – (j,i)
As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.
A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a “mini-batch”) at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.
38
Example: Text Categorization
Example by Leon Bottou:
Reuters RCV1 document corpus
Predict a category of a document
One vs. the rest classification
n = 781,000 training examples (documents)
23,000 test examples
d = 50, 000 features
One feature per word
Remove stop-words
Remove low frequency words
39
Examples: Text Categorization
Questions:
Is SGD successful at minimizing f(w,b)?
How quickly does SGD find the min of f(w,b)?
What is the error on a test set?
SGD-SVM is successful at minimizing the value of f(w,b)
SGD-SVM is super fast
SGD-SVM test set error is comparable
40
Optimization “Accuracy”
41
SGD vs. Batch Conjugate Gradient
SGD on full dataset vs. Batch Conjugate
Gradient on a sample of n training examples
42
Practical Considerations
Need to choose learning rate
Leon suggests:
Choose so that the expected initial updates are comparable with the expected size of the weights
Choose :
Select a small subsample
Try various rates (e.g., 10,1,0.1,0.01,…)
Pick the one that most reduces the cost
Use for next 100k iterations on the full dataset
43
43
Practical Considerations
Sparse Linear SVM:
Feature vector is sparse (contains many zeros)
Do not do:
But represent as a sparse vector
Can we do the SGD update more efficiently?
Approximated in 2 steps:
44
Cheap: is sparse and so few coordinates j of w will be updates
Expensive: w is not sparse, all coordinates need to be updated
Practical Considerations
Solution 1:
Represent vector w as the product of scalar s and the vector v
Then the update procedure is:
1)
2)
Solution 2:
Perform only step 1) for each training example
Perform step 2) with lower frequency and higher
45
Practical Considerations
Stopping criteria:
How many iterations of SGD?
Early stopping with cross validation
Create validation set
Monitor cost function on the validation set
Stop when loss stops decreasing
46
Practical Considerations
Stopping criteria:
How many iterations of SGD?
Early Stopping
Extract two disjoint subsamples A and B of training data
Train on A, stop by validating on B
Number of epochs is an estimate of k
Train for k epochs on the full dataset
47
What about Multiple Classes?
Idea 1:
One against all
Learn 3 classifiers
+ vs. {o,-}
– vs. {o,+}
o vs. {+,-}
Obtain:
Return class c
48
What about Multiple Classes?
Idea 2:
Learn 3 sets of weights simultaneously
Want the correct class to have highest margin:
49
Multiclass SVM
Optimization problem:
To obtain parameters for each class c, we can use similar techniques as for 2 class SVM
SVM is widely perceived a very powerful learning algorithm
50
Demo
Libsvm package for R:
http://cran.r-project.org/web/packages/e1071/index.html
51
Demo
> # load library, class, a dependence for the SVM library
> library(class)
> # load library, SVM
> library(e1071)
> # load library, mlbench, a collection of some datasets from the UCI repository
> library(mlbench)
> # load data, has 7 classes, details of data: http://archive.ics.uci.edu/ml/datasets/Glass+Identification
> data(Glass, package = “mlbench”)
> # get the index of all data
> index <- 1:nrow(Glass)
> # generate test index
> testindex <- sample(index, trunc(length(index)/3))
> # generate test set
> testset <- Glass[testindex, ]
> # generate trainin set
> trainset <- Glass[-testindex, ]
52
Demo
> # train svm on the training set
> # cost=100: the penalizing parameter for C-classification
> # gamma=1: the radial basis function-specific kernel parameter
> # Output values include SV, index, coefs, rho, sigma, probA, probB
> svm.model <- svm(Type~ ., data = trainset, cost = 100, gamma = 1)
> # a vector of predicted values,
> # for classification: a vector of labels
> svm.pred <- predict(svm.model, testset[, -10])
> # a cross-tabulation of the true
> # versus the predicted values
> table(pred = svm.pred, true = testset[, 10])
53
One-slide Takeaway
SVM:
Linear Separable SVMs
Non-linear Separable SVMs: Soft Margin and Kernel Trick
Parameter Estimation:
Solver: e.g. libsvm, not efficient
Stochastic gradient descent
54
Outline
Support Vector Machines
History
Linear Separable SVMs
Non-linear Separable SVMs
Soft Margin
Kernel Trick
Parameter Estimation
Further Reading
55
Further Reading
Early paper about SVM algorithm: http://link.springer.com/content/pdf/10.1007%2FBF00994018.pdf
More kernel techniques:
Schölkopf, Bernhard; Burges, Christopher J. C.; and Smola, Alexander J. (editors); Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999. ISBN 0-262-19416-3.
56
Further Reading
More efficient learning algorithm for SVM:
Parallelizing Support Vector Machines on Distributed Computers: https://code.google.com/p/psvm/
57
58
Reference
http://www.stanford.edu/class/cs246/slides/13-svm.pdf
http://www.stanford.edu/class/cs276/handouts/lecture14-SVMs.ppt
http://i.stanford.edu/~ullman/pub/ch12.pdf
http://www.svms.org/tutorials/
http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Chang, E, Zhu, K, Wang, H, Bai, H, Li, J, Qiu, Z, and Cui, H. PSVM: Parallelizing support vector machines on distributed computers. NIPS, 20:257-264. 2007.
58
In-class Practice
Consider building an SVM over the (very little) data set shown in above figure, compute the each SVM decision boundary.
59
(2,3)
� : R2 ! R3
(x1, x2) 7! (z1, z2, z3) := (x21,
p
2x1x2, x
2
2)