CS计算机代考程序代写 data mining algorithm COMP3308/3608, Lecture 10a

COMP3308/3608, Lecture 10a
ARTIFICIAL INTELLIGENCE
Support Vector Machines
Reference: Russell and Norvig, pp. 744-748 Witten, Frank, Hall and Pal, ch.7.2, pp. 224-227
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 1

Outline
• Maximum margin hyperplane
• Linear SVM with hard margin
• Linear SVM with soft margin
• Nonlinear SVM
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 2

Support Vector Machines (SVM)
• Very popular classification method
• Rooted in statistical learning theory
• Advantages
• Can form arbitrary decision boundaries (both linear and non-linear)
• The decision boundary is a maximum margin hyperplane (i.e. has the highest possible distance to the training examples from the 2 classes) and this helps to generalize well on new examples
• The decision boundary is defined by a sub-set of the training vectors called support vectors => resistant to overfitting
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 3

Separation by a Hyperplane
• Given is the following training data (2 class problem: squares and circles):
• Find a linear boundary (line, hyperplane) that will separate the data
• Is the data linearly separable?
• Recall: Linearly separable = there exist a line (hyperplane) such that all squares are on one side and all circles are on the other side
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 4

B 1
Separation by a Hyperplane (2)
• 1 possible decision boundary
another one
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 10a, 2021 5
B
2

Separation by a Hyperplane (3)
• Other possible decision boundaries – there are many lines
2
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 6


A decision boundary B1: B1
Support vectors are the examples (data points) that lie closest to the decision boundary; they are circled
Margin – the separation between the boundary and the closest examples (or the width the boundary can be increased before touching an example)
The boundary is in the middle of the margin
Margin of a Hyperplane and Support Vectors
margin
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 10a, 2021 7

B1
The support vectors just touch the margin of the decision boundary
It is possible to have more than 1 support vector for each class
For our example: 5 support vectors, 3 for class square and 2 for class circle
Support Vectors
margin
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 10a, 2021 8

B1
Support Vectors (2)
Why “vectors”?
On the previous slides we saw only the tips of the support vectors; here we can see the actual vectors
Remember that the given training examples are vectors (input vectors) where each dimension corresponds to 1 feature
The support vectors are a subset of the input vectors => they are also vectors
margin
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 10a, 2021 9

Support Vectors (3)
• The support vectors define the decision boundary (they are the closest examples to the decision boundary by definition)
If we move a support vector, the decision boundary will change
moved
margin
margin
moved
margin
If we move another input example, that is not a support vector, the decision boundary will not change
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 10

Which Hyperplane is Better?
• Which hyperplane (B1 or B2) should we select? Which one is likely to classify more accurately new data?
B1
margin
B2
• The hyperplane with the bigger margin, B1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 11

Maximum Margin Hyperplane
• The hyperplane (separator) with the biggest margin is called the maximum margin hyperplane (separator)
• It is the hyperplane with the highest possible distance to the training examples
• SVM selects the maximum margin hyperplane
• Decision boundaries with large margins (i.e. farthest away from the training data) are better than those with small margins as they typically are more accurate on new examples
• Intuitive justification – feels safer
• We don’t know where the new examples will be but we assume that they
are drawn from the same distribution as the training examples
• If we have made a small error in the location of the boundary, the chances of causing misclassifications will be smaller if the margin is big
• => big margin is less sensitive to noise and overfitting
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 12

Maximum Margin – Formal Justification
• Formal justification from computational learning theory: structural risk minimization principle
• The generalization error (the error on new examples) depends on the training error and model complexity (also called model capacity)
• If the training error is the same, the classifier with the lower complexity will generalize better
• small margin=> high complexity=>higher generalization error
• big margin => low complexity => lower generalization error
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 13

• •
Linear Decision Boundary – Revision
Binary classification problem, N training examples (input vectors)
x is the input vector, y is the class value

A decision boundary of a linear classifier is
wx+b=0 parameters
( x i , y i ) , i = 1, . . , N
xi =(xi1,…,xim)T,yi ={−1,1}
Training data example : x1 =(2,2),y1 =1
x2 =(0.5,1),y2 =−1,etc.
x2
w
x1
Irena Koprinska, irena.koprinska@sydney.edu.au
x +0.5x −1=0 12
COMP3308/3608 AI, week 10a, 2021 14

How do we Classify New Examples?
if xisabovethedecisionboundary:wx+b0 if xisbelowthedecisionboundary:wx+b0
x2
w
i.e. = sign(w  x + b)

xw+b=0
example x by calculating f=wx+b and determining the sign
x1
If we know the decision boundary, we can easily classify a new
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 15

• •

• •

Problem Statement for SVM – Definitions
Our separating hyperplane is H
H is in the middle of 2 other hyperplanes, H1 and H2, defined as:
H1 :wx+b=1 H 2 : w  x + b = −1
The points laying on H1 and H2 are the support vectors
disthemarginof H
It can be shown that: d = w2
(How? By calculating the distance between a point from H and the hyperplane H1 (1/(||w||), then *2)
To maximize the margin d, we need to minimize ||w|| 1 This is equivalent to minimizing the quadratic function: 2
H1 H
H2
• •
w2
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 16

• •

Learning a Maximum Margin Hyperplane in Linear SVM
Given: a set of labelled training examples
Learn: the maximum margin hyperplane such as all training examples are classified correctly
This could be formulated as a constraint optimization problem:
• Given N training examples
( x i , y i ) , i = 1, . . , N
xi =(xi1,…,xim)T,yi ={−1,1} training vector class
12 w 2
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 17
• Minimize
• Subject to the linear constraint
Maximazing the margin
i.e. i=1..N
yi(wxi +b)1,i
Another way of expressing correct classification of all training examples

More on the Linear Constraint
yi(wxi +b)1,i (1)
• It is a combination of 2 conditions:
• All training examples from class y=1 (squares) must be above the
H1 :wx+b=1 wxi+b1 ifyi=1 (2)
• All training examples from class y=-1 (circles) must be below the hyperplane H2 :wx+b = −1
wxi+b−1 ifyi=−1 (3)
(2) and (3) together can be combined in (1) (a more compact form)
hyperplane
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 18


The Optimization Problem
The optimization problem can be transformed into its dual
N1N maxw()=i −2ijyiyjxi xj
Dot product of pairs of training vectors

i yi = 0training vectors
This is a Quadratic Programming (QP) problem and can be solved
i=1 subject to i  0,
N i=1
i, j=1
Target value (class value) of the
(i.e. the s can be estimated) using a standard numerical procedure
• Global maximum of the s always exist
• s are called Lagrange multipliers
• w (the solution, i.e. the optimal decision boundary) is given by
w=N iyixi i=1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 19

Characteristics of the Solution
N
w = i yixi
• Many of the s are 0
• The optimal decision boundary w is a linear combination (coefficient  *
i=1
target value*training vector) of a small number of training examples
• The training examples xi with non-zero i are the support vectors and they are the examples closest to the decision boundary w
• => the optimal decision boundary w is a linear combination of support vectors
• To classify a new example z:
f =wz+b=N  y x z+b Dotproductofthenew i i i vector and the support
i =1 vectors
i.e. the new example belongs to class 1, if f>0
sign( f )
or class -1 if f<0 Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 20 x1 x2 features class Example • 8 2-dim. training examples; 2 classes: -1,1 • After solving the problem with QP we find the s and only 2 of them are non- zero and they correspond to the support vectors for the data (x1 & x2) • Using the s , the weights (defining the decision boundary are): w2 = b = 7.93 Support vectors w = 1  i=1  y x i i i1 = 65.5261(1*0.3858−1*0.4871) = −6.64 = 65.5261(1*0.4687 −1*0.611) = −9.32 2 Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 21 2  i=1  y x i i i2 //there is a formula for b (not shown) • Classifying new examples: • • above the decision boundary: class 1 below: class -1 • • B1 Soft Margin The method we discussed so far constructs decision boundaries that are free of misclassifications. We can allow some misclassifications. Ex.: B1 is better than B2 as its margin is bigger but now 2 new examples are added; B1 misclassifies them but B2 still classifies them correctly margin B2 • • • This does not mean that B2 is better as the new examples may be noise B1 should still be preferred as it has wider margin and is less sensitive to overfitting and noise We can modify our method to allow some misclassifications, i.e. by considering the trade-off between the margin width and the number of misclassifications As a result, the modified method will construct linear boundary even if the data is not linearly separable Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 22 • Soft Margin Solution • The optimisation problem formulation is similar but there is an additional parameter C in the function we want to maximize corresponding to the trade-off between error and margin • The solution is the same as in the hard margin case but there is an upper bound C on the values of the s Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 23 Non-linear SVM • In practice most problems are linearly non-separable • SVM with soft margin will find a linear boundary that optimizes the trade-off between the errors on training data and the margin • SVM can further be extended to find non-linear boundary Soft margin SVM Non-linear decision boundary Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 24 Non-linear SVM - Idea • Transform the data from its original feature space to a new space where a linear boundary can be used to separate the data • Motivation for this - Cover’s theorem: • A complex classification problem cast in high dimensional space nonlinearly is more likely to be linearly separable than in a low dimensional space (Cover, 1965) • => 2 important properties of the transformation: • 1) It is nonlinear
• 2) It is to a higher dimensional space
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 25

Example
• Non-linearly separable data x = ( x , x ) in the original space and linearly
separable in the new space
12
original space (2-dim):
(x,x ) 12
transformation from old to new space
decision boundaries in old and new space
new space (3-dim):
(x2, 2xx,x2) 1122
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 10a, 2021 26

Issue 1
• How do we choose the dimensionality of the new space, so that the data is linearly separable in the new space?
• Theorem: If a data set is mapped to a space of sufficiently high dimension, it will always be linearly separable.
• OK, then, let’s use a space with infinite dimensionality
• Danger: Overfitting!
• Recall that a line in a d-dim space is defined by an equation with d
parameters => if d  N (number of training examples), there is a serious
danger of overfitting
• => there are restrictions on the dim of the new space defined by the data
• We will also deal with overfitting by constructing the maximum margin hyperplane in the new space (recall the structural risk minimization principle)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 27

Issue 2
• Even if we know what the transformation  should be, solving a QP task in the new, higher dimensional space, is computationally very
expensive, e.g. for our example:
1) Transform the 2-dim training data into 3-dim using , i.e. compute the new features:

xi →(xi),xj →(xj)
x,x 12
– a pair of training vectors in the original 2-dim feature space
2) Feed the 3-dim vectors to the QP solver
Recall that QP computes dot product of pairs of training vectors:
N 1 N Before (in the original 2-dim space) maxW()=i −2ijyiyjxi xj
i=1 i, j=1 =
(xi )  (x j ) Now (in the new 3-dim space)
However, what about if 3 is 10 or more? The computation increases dramatically! First we need to compute the new higher dim features, then their dot products.
Solution: kernel trick
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 28

Kernel Trick
• Method for computing the dot product of a pair of vectors in the new space without first computing the transformation of each vector from the original to the new space
• We need the dot product of the features in the new space (the
transformed features)
• We will not compute the features in the new space and then compute
their dot product
2)(xi).(x j)
(xi )(xj )
1 ) x i →  ( x i ) , x j →  ( x j )
• We will compute the dot product of the original features and use it in a function (called kernel function) to determine the value the dot
product of the transformed features
xi.x j −  (xi).(x j)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 29

Kernel Trick for Our Example
• 2 dim original and 3 dim new space,  : ( x , x ) = ( x 2 , 2 x x , x 2 ) 121122
• Consider a pair of vectors in original space: u, v (they are 2-dim). They were transformed into the (3 dim.) vectors (u), (v) in the new space, i.e.
 u→(u),v→(v)
• Let’s calculate the dot product of (u) and (v) (u)(v)=(u2, 2uu ,u2)(v2, 2vv ,v2)=
11221122
=u2v2+2uuvv +u2v2 =(uv)2+(uv)2+2uuvv = 11 1212 22 11 22 1212
=(uv +u v )2 =(uv)2 11 22
• The dot product in the new space can be expressed via the dot product in
the original space!
(u)(v) = (uv)2
Nice property!
• This means that the dot product in the new space can be computed without
first computing  for each input vector! Instead we will compute a function
in the original space to evaluate the dot product in the new space!
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 30

Kernel Trick
(u)(v) = (uv)2
• The right-hand side is called a kernel function and is written as K (u, v)
K (u, v) = (u  v)2
• So, we will compute a kernel function in the original space for each pair of training vectors to evaluate the pair’s dot product in the new space
• Thus, to find a linear boundary in the new (higher dimensional) space we will feed the QPsolvernot with (u)(v) but with K(u,v)
• Thus, we learn in higher dimensional space by computing kernel functions in lower dimensional space instead of first transforming each vector in the higher dimensional space and then computing dot products between vectors
=>
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 31

Mercer’s Theorem
• Recall the special property of the kernel function for our example K(u,v) = (u)(v) = (uv)2
• Functions K for which this is true (i.e. such  exist) need to satisfy the Mercer’s Theorem, i.e. this restricts the class of functions K we can use
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 32

Frequently Used Kernel Functions
• Some kernels satisfying the Mercer’s Theorem are K(x,y)=(xy+1)p −polynomialkernel
− x−y 2
K(x,y) = e 2 2 − RBF
K(x,y)=tanh(kxy−)−tanget hyperbolic (satisfiesMercer’sTh.only forsomekand)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 33

Kernel Trick Again
K(u,v)=(u)(v)−kernel trick
• We specify K (should satisfy Mercer’s Theorem), i.e. we specify
 indirectly, instead of choosing 
• We perform the dot product computation in the original space
which has smaller dimensionality
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 34

The Optimization Problem Using Kernel Functions – Training
• Original, without kernel function solution
N1NN
w=iyixi i=1
Optimal hyperplane in theoriginalspace
• With kernel function N1NN
maxw()=i −2ijyiyjxi xj
i=1
subjecttoi 0,iyi =0
N i=1
i, j=1
maxw()=i −2ijyiyjK(xi,xj) w=iyi(xi)
i=1 i, j=1 subjecttoi 0,N iyi =0
i=1
Irena Koprinska, irena.koprinska@sydney.edu.au
i=1
Optimal hyperplane in
the new space
COMP3308/3608 AI, week 10a, 2021 35

The Optimization Problem Using Kernel Functions – Classification of New Examples
• Classify new example z as sign(f), i.e. class 1 if f>0 and class -1 if f<0: Dot product of the new vector and the support vectors Kernel function of the new vector and the support vectors • Original, without kernel function N f =wz+b=iyixi z+b i=1 • With kernel function N f =wz+b=iyiK(xi,z)+b i=1 • In both cases the summation is over the support vectors, i.e. a subset of all training examples N as many s are 0 Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 36 SVM – Summary • Very popular and useful method for classification • Most popular “off-the-shelf” classification method as it has only a few parameters that are easy to tune • Three key concepts: • 1) Maximize the margin of the decision boundary • 2) Transform data into a higher dimensional space where it is more likely to be linearly separable • 3) Kernel trick – do the calculations in the original, not in the new higher dimensional space Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 37 SVM – Summary (2) • Linear SVM – No kernel function; finds the optimal linear decision boundary between the positive and negative examples in the original feature space • Non-linear SVR – Uses kernel function; finds the optimal linear boundary in the new space. This boundary (hyperplane) when mapped back to the original feature space, corresponds to a non-linear decision boundary between the positive and negative examples • Scales well in high dimensions • Multi-class problems need to be transformed into 2-class problems Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 38 SVM, Perceptrons and Backpropagation NNs – Comparison • Perceptrons: simple and efficient training algorithm but can learn only linear decision boundaries • Backpopagation NNs: hard to train (too many parameters to set, local minima) but can learn non-linear boundaries • SVM: relatively efficient training algorithm that can learn non-linear decision boundaries Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 39 More on Kernels • Kernels have been designed for strings, trees and non- numeric data • Kernelisation can be applied to all algorithms that can be re- formulated to work only with dot products (once this is done the dot product is replaced with a kernel function) • e.g. k-nearest neighbor and others Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 40 SVM Links • SVMs were introduced by Cortes and Vapnik (1995) https://link.springer.com/article/10.1007/BF00994018 • Tutorial by Christopher Burges https://drive.google.com/open?id=0B6x5IP0EWft1RFZGYWpodDlGZXc • Software - SVM Light http://svmlight.joachims.org/ • Links http://www.svms.org/ http://www.svms.org/links.html Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 41 Acknowledgements • Some slides based on Tan, Steinbach, Kumar, Introduction to Data Mining, Addison-Wesley • Some slides adapted from Martin Law’s slides http://download.informatik.uni-freiburg.de/lectures/ML/2004- 2005WS/Misc/Slides/15_1_SVMachines_4up.pdf Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 10a, 2021 42