CS计算机代考程序代写 information retrieval data mining algorithm Introduction to Information Retrieval

Introduction to Information Retrieval
SUPPORT VECTOR MACHINE
Mainly based on
https://nlp.stanford.edu/IR-book/pdf/15svm.pdf
1

Introduction to Information Retrieval
Overview
▪ SVM is a huge topic
▪ Integration of MMDS, IIR, and Andrew Moore’s slides here ▪ Our foci:
▪ Geometric intuition ➔ Primal form
▪ Alternative interpretation from Empirical Risk Minimization point
of view.
▪ Understand the final solution (in the dual form)
▪ Generalizes well to Kernel functions ▪ SVM + Kernels
2

Introduction to Information Retrieval
Ch. 15
wTxi + b = 0 Linear classifiers: Which Hyperplane?
▪ Lots of possible solutions for a, b, c.
▪ Some methods find a separating
hyperplane, but not the optimal one
[according to some criterion of expected goodness]
▪ E.g., perceptron
▪ Support Vector Machine (SVM) finds an
optimal* solution.
▪ Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary
▪ One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions
This line represents the decision boundary:
ax + by − c = 0
3

Introduction to Information Retrieval
Sec. 15.1
Another intuition
▪ If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased
4

Introduction to Information Retrieval
Sec. 15.1
Support Vector Machine (SVM)
▪ SVMs maximize the margin around the separating hyperplane.
▪ A.k.a. large margin classifiers
▪ The decision function is fully specified by a subset of training samples, the support vectors.
▪ Solving SVMs is a quadratic programming problem
▪ Seen by many as the most successful current text classification method*
Support vectors
Maximizes
Narrower margin
margin
*but other discriminative methods often perform very similarly
5

Introduction to Information Retrieval
Maximum Margin: Formalization
▪ w: decision hyperplane normal vector
▪ xi: data point i
▪ yi: class of data point i (+1 or -1)
▪ Classifier is: f(xi) = sign(wTxi + b)
NB: Not 1/0
Sec. 15.1
▪ Functional margin of xi is: yi (wTxi + b)
▪ But note that we can increase this margin simply by scaling w, b….
▪ Functional margin of dataset is twice the minimum functional margin for any point
▪ The factor of 2 comes from measuring the whole width of the margin
NB: a common trick
6

Introduction to Information Retrieval
wTxi + b = 7.4 Largest Margin
C
+
wTxi + b = 3.7
A
+

Distance from the separating hyperplane corresponds to the “confidence”
+ B
+ +
+ +
+ +
– –
– ▪Example:
– w-


▪ We are more sure about the class of A and B than of C
– –
of prediction
wTxi + b = 0
7

Introduction to Information Retrieval
Sec. 15.1
Geometric Margin
▪ Distance from example to the separator is r = y
▪ Examples closest to the hyperplane are support vectors.
▪ Marginρof the separator is the width of separation between support vectors
wTx+b w
of classes.
x
r
ρ x′
Algebraic derivation of finding r: Dotted line x’−x is perpendicular to decision boundary so parallel to w. Unit vector is w/||w||, so line is rw/||w||, for some r.
x’ = x – yrw/||w||.
x’ satisfies wTx’+b = 0.
So wT(x –yrw/||w||) + b = 0 Recall that ||w|| = sqrt(wTw).
So wTx –yr||w|| + b = 0
So, solving for r gives:
r = y(wTx + b)/||w|| 8
w

Introduction to Information Retrieval
Help from Inner Product
▪ Remember: Dot product / inner product
𝑨 𝒄𝒐𝒔𝜽
scalar
vector
▪A’s projection onto B = ( / ) * B
9

Introduction to Information Retrieval
Derivation of the Geometric Margin
▪ d(A, L) = //
▪ Note that H is on the line L, so + b = 0
▪ Therefore = ( + b)/
A+
L
w
(0,0)
H
10

Introduction to Information Retrieval
Sec. 15.1
Linear SVM Mathematically
The linearly separable case
▪ Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1 wTxi + b≤−1 if yi =−1
▪ For support vectors, the inequality becomes an equality
▪ Then, since each example’s distance from the hyperplane is
▪ The margin is:
r=y r=
wTx+b w
2
w
11

Introduction to Information Retrieval
Sec. 15.1
Derivation of ρ
ρ
wTxb + b = -1
wTxa + b = 1
▪ Hyperplane wT x + b = 0
▪ ▪
Extra scale constraint: mini=1,…,n |wTxi + b| = 1
This implies: wT(xa–xb) = 2
ρ= ||xa–xb||2 = 2/||w||2
wT x+b=0
12

Introduction to Information Retrieval
Sec. 15.1
Solving the Optimization Problem
▪ This is now optimizing a quadratic function subject to linear constraints
▪ Quadratic optimization problems are a well-known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs)
▪ The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem:
Find w and b such that Φ(w)=1⁄2wTw isminimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
Find α …α such that 1N
Q(α) =Σα – 1⁄2ΣΣααy y x Tx is maximized and i ijiji j
(1) Σαy = 0 ii
(2)α≥0 for allα ii
13

Introduction to Information Retrieval
Sec. 15.1
Geometric Interpretation
Find w and b such that Φ(w)=1⁄2wTw isminimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
• What are fixed and what are variables?
• Linear constraint(s): Where can the variables be?
• Quadratic objective function:
14

Introduction to Information Retrieval
Sec. 15.1
The Optimization Problem Solution
▪ The solution has the form:
▪ Each non-zero αi indicates that corresponding xi is a support vector.
▪ Then the scoring function will have the form:
Q: What are the model parameters? What does f(x) mean intuitively?
▪ Classification is based on the sign of f(x)
▪ Notice that it relies on an inner product between the test point x and the
support vectors xi
▪ We will return to this later.
▪ Also keep in mind that solving the optimization problem involved computing the inner products between all pairs of training points.15
w =Σαyx b=y-wTx foranyx suchthatα0 iiikkkk
f(x) = Σαy x Tx + b iii

Introduction to Information Retrieval
Sec. 15.2.1
Soft Margin Classification
▪ If the training data is not
linearly separable, slack
variables ξ can be added to i
allow misclassification of difficult or noisy examples.
▪ Allow some errors
▪ Let some points be moved i to where they belong, at a
cost
▪ Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin)
ξ
ξ
j
16

Introduction to Information Retrieval
Soft Margin Classification Mathematically
▪ The old formulation:
Sec. 15.2.1
[Optional]
Find w and b such that
Φ (w) =1⁄2 wTw is minimized and for all {(xi ,yi)} yi (wTxi + b)≥1
▪ The new formulation incorporating slack variables:
Find w and b such that
Φ (w) =1⁄2 wTw + CΣξ is minimized and for all {(x ,y )} Tiii
y (w x + b)≥1-ξ and iiii
ξ≥0 for all i
▪ Parameter C can be viewed as a way to control overfitting ▪ A regularization term
17

Introduction to Information Retrieval
Alternative View
▪ SVM in the “natural” form
argmin 12 ww+Cn max0,1−yi(wxi +b)
w,b Margin i=1
Empirical loss L (how well we fit training data)
Hyper-parameter related to regularization
▪ SVM uses “Hinge Loss”:  min12 w2+Cn 
0/1 loss
i s.t.i,yi (wxi +b)1−i
w,b i=1
Hinge loss: max{0, 1-z}
-1 0 1 2
z = yi (xi w+b)
18
penalty

Introduction to Information Retrieval
Sec. 15.2.1
Soft Margin Classification – Solution
▪ The dual problem for soft margin classification:
[Optional]
Find α …α such that 1N
Q(α) =Σα – 1⁄2ΣΣααy y x Tx is maximized and i ijiji j
(1) Σαy = 0 ii
(2) 0≤α≤C for allα ii
▪ Neither slack variables ξ nor their Lagrange multipliers appear in the dual i
problem!
▪ Again, xi with non-zero αi will be support vectors. ▪ Solution to the dual problem is:
w is not needed explicitly for classification!
w =Σαyx iii
b = y (1-ξ) – wTx where k = argmaxα kkk k’k’
f(x) = Σαy x Tx + b iii
19

Introduction to Information Retrieval
Sec. 15.1
Classification with SVMs
▪ Given a new point x, we can score its projection onto the hyperplane normal:
T
▪I.e., compute score: w x + b =Σαyx x + b
▪ Decide class based on whether < or > 0 ▪ Can set confidence threshold t.
Score > t: yes Score < -t: no Else: don’t know i i iT -10 1 20 Introduction to Information Retrieval Sec. 15.2.1 Linear SVMs: Summary ▪ The classifier is a separating hyperplane. ▪ The most “important” training points are the support vectors; they define the hyperplane. ▪ Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi. ▪ Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: Find α ...α such that 1N Q(α) =Σα - 1⁄2ΣΣααy y x Tx is maximized and i ijiji j (1) Σαy = 0 ii (2) 0≤α≤C for allα ii 21 Introduction to Information Retrieval Support Vector Regression ▪ Find a function f(x) with at most ε-deviation frm the targety yi -(wTxi +b)>=-ε ▪ The optimization problem
yi – (wTxi + b) <= ε Find w and b such that Φ (w) =1⁄2 wTw is minimized and for all {(xi ,yi)} yi – (wTxi + b)≥-ε yi – (wTxi + b)≤ε ▪ We can introduce slack variables ▪ Similar to soft margin loss function 22 Introduction to Information Retrieval Sec. 15.2.3 Non-linear SVMs ▪ Datasets that are linearly separable (with some noise) work out great: x 0 ▪ But what are we going to do if the dataset is just too hard? 0 x ▪ How about ... mapping data to a higher-dimensional space: x2 0 x 23 Introduction to Information Retrieval Non-linear SVMs: Feature spaces ▪ General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x→φ(x) c.f., polynomial regression Sec. 15.2.3 24 Introduction to Information Retrieval Sec. 15.2.3 The “Kernel Trick” ▪ The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj ▪ If every datapoint is mapped into high-dimensional space via some transformation Φ : x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj) ▪ A kernel function is some function that corresponds to an inner product in some expanded feature space. ▪ Usually, no need to construct the feature space explicitly. 25 Introduction to Information Retrieval Example Sec. 15.2.3 What about ? O(d2) new cross-term features Linear 26 Non-linear Non-linear + feature combination Introduction to Information Retrieval Why feature combinations? ▪ Examples: ▪ Two categorical features (age & married) encoded as one-hot encoding➔combination = conjunction rules ▪ e.g., 1[age in [30, 40) AND married = TRUE] ▪ [..., eagerness-for-travel, income, ...]➔combination indicates how much to spend on travel ▪ e.g., “travel rarely” AND ”high income”, among other combinations ▪ NLP, feature vector = 1[w ∈ x] ➔ combination indicates two word co- occurrence (where phrase/multi-word expression (MWE) is just a special case) ▪ x➔φ(x), then a linear model in the new feature space is just wTφ(x) +b ▪ each feature combination will be assigned a weight wi ▪ irrelevant features combinations will get 0 weight 27 Introduction to Information Retrieval Why feature combinations? /2 ▪ Also helpful for linear models ▪ Linear regression assumes no interaction between xi and xj ▪ One can add manual interaction terms, typically xi * xj, to still use linear regression (to learn a non- linear model! ) 28 Introduction to Information Retrieval Inner product in an infinite dimensional space! RBF kernel: [Optional] 29 Introduction to Information Retrieval String Kernel ▪ K(s1, s2) should evaluate the similarity between the two strings ▪ Without this, sim(“actor”, “actress”) = 0 ▪ Intuition: ▪ consider all substrings as (binary) “features” ▪ inner product in that “enhanced” feature space means the number of common substrings the two share. ▪ Variants: ▪ (more complex): consider subsequences (with possibly gap penalty) ▪ (simpler): consider all k-grams, and use Jaccard ▪ bigrams(actor) = {ac, ct, to, or} ▪ bigrams(actress) = {ac, ct, tr, re, es, ss} ▪ Jaccard(actor, actress) = 2/8 [Optional] 30 Introduction to Information Retrieval Sec. 15.2.3 Kernels ▪ Why use kernels? ▪ Make non-separable problem separable. ▪ Map data into better representational space ▪ Can be learned to capture some notion of “similarity” ▪ Common kernels ▪ Linear Td ▪ Polynomial K(x,z) = (1+x z) ▪ Gives feature combinations ▪ Radial basis function (infinite dimensional space) ▪ Text classification: ▪ Usually use linear or polynomial kernels 31 Introduction to Information Retrieval Sec. 15.1 Classification with SVMs with Kernels ▪ Given a new point x, we can compute its score i.e., ▪ Decide class based on whether < or > 0 ▪ E.g., in scikit-learn
-10 1
32

Introduction to Information Retrieval
Sec. 15.1
Pros and Cons of the SVM Classifier
▪ Pros
▪ High accuracy
▪ Fast classification speed
▪ Works with cases where #features > #samples
▪ Can adapt to different objects (due to Kernel)
▪ Any K(u, v) can be used if symmetric, continuous, and positive
semi-definite.
▪ Or explicit engineer the feature space.
▪ Cons
▪ Training does not scale well with number of training
samples (O(d*n2) or O(d*n3))
▪ Hyper-parameters needs tuning
33

Introduction to Information Retrieval
Ch. 15
Resources for today’s lecture
▪ Christopher J. C. Burges. 1998. A Tutorial on Support Vector Machines for Pattern Recognition
▪ S. T. Dumais. 1998. Using SVMs for text categorization, IEEE Intelligent Systems, 13(4)
▪ S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and representations for text categorization. CIKM ’98, pp. 148-155.
▪ Yiming Yang, Xin Liu. 1999. A re-examination of text categorization methods. 22nd Annual International SIGIR
▪ Tong Zhang, Frank J. Oles. 2001. Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31
▪ Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York.
▪ T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
▪ Fan Li, Yiming Yang. 2003. A Loss Function Analysis for Classification Methods in Text
Categorization. ICML 2003: 472-479.
▪ Tie-Yan Liu, Yiming Yang, Hao Wan, et al. 2005. Support Vector Machines Classification with Very Large Scale Taxonomy, SIGKDD Explorations, 7(1): 36-43.
▪ ‘Classic’ Reuters-21578 data set: http://www.daviddlewis.com /resources /testcollections/reuters21578/