程序代写CS代考 data mining §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Support Vector Machines
MAST90083 Computational Statistics and Data Mining
Karim Seghouane
School of Mathematics & Statistics The University of Melbourne
Support Vector Machines 1/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Outline
§7.1 Separating Hyperplane
§7.1 Support vector classifier
§7.1 Support vector Machines
Support Vector Machines 2/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
In This Lecture
This Lecture focuses on separating hyperplanes, then developing the theory to cover support vector machines (SVMs).
􏰔 Separating hyperplanes 􏰔 Support vector classifier 􏰔 Support vector machines
Note that the discussion will almost entirely focus on binary classification, although methods modifying SVMs for regression will be introduced.
Support Vector Machines 3/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Introduction
Examples of classification methods that estimate linear decision boundaries
􏰔 Logistic regression
􏰔 Linear discriminant analysis
They are based on different approaches and assumptions
Support Vector Machines 4/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Support Vector Machines 5/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Separating hyperplanes
􏰔 We starts by describing classifiers based on separating hyperplane
􏰔 These classifiers estimate linear decision boundaries that separate the data into different classes
􏰔 They form the basis for support vector classifiers
Support Vector Machines 6/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Hyperplanes
􏰔 In a p−dimensional space a hyperplane is a flat affine (doesn’t necessarily pass through the origin) subspace of dimension p−1
􏰔 Example: In two dimensions, a hyperplane is a flat one dimensional subspace – a line
􏰔 In three dimensions, a hyperplane is a flat two-dimensional subspace – a plane
Support Vector Machines 7/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Hyperplanes
􏰔 In a p−dimensions, a hyperplane is defined by β0 + x⊤β = 0
where x = (x1,…,xp)⊤ ∈ Rp and β0, β = (β1,…,βp)⊤ are the parameters.
􏰔 If a point x in a p−dimensional space satisfies this equation, then x“lies”or“is a point”on the hyperplane
􏰔 The hyperplane can be used to divide the p−dimensional space into two halves
Support Vector Machines 8/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Hyperplanes
􏰔 Assume x is such that
β0 + x⊤β > 0
this signifies that x lies to one side of the hyperplane.
􏰔 On the other hand, if
β0 + x⊤β < 0 then x lies on the other side of the hyperplane 􏰔 One can easily determine on which side of the hyperplane a point lies by calculating the sign of β0 + x⊤β. Support Vector Machines 9/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Linear decision boundaries Some existing classification methods create a linear decision boundary, where a hyperplane creates two regions with a different classification corresponding to each. Logistic regression is one such example. For instance, if p = 2, then the classification is based on some cutoff, 􏰌1􏰍 I 1+exp{−(α+β1Xi1+β2Xi2)}>s fors∈[0,1].
This threshold condition inside the indicator function may be re-expressed as
􏰌1 􏰍 −(α+β1Xi1 +β2Xi2) 0 i f y i = 1
β 0 + x ⊤i β < 0 i f y i = − 1 y􏰈β +x⊤β􏰉>0 forall i=1,…,n. i0i
Support Vector Machines 12/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Hyperplanes
􏰔 Thesignoff(x∗)=β0+x∗Tβallowustoclassifythenewor test observation to class 1 (f (x∗) > 0) or −1 (f (x∗) < 0). 􏰔 The magnitude of f (x∗) informs us on how far x∗ lies from the hyperplane 􏰔 If f (x∗) >> 0, then x∗ lies far from the hyperplane
􏰔 If f (x∗) is close to zero, then x∗ is located near the hyperplane
and so we are less certain about its assignment.
􏰔 A classifier based on a separating hyperplane uses a linear decision boundary.
Support Vector Machines 13/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Hyperplanes
􏰔 Consider a hyperplane L
􏰖x : f (x) = β0 + x⊤β = 0􏰗
where β is the weight vector with Euclidean norm ∥β∥. In R2 this corresponds to a line.
􏰔 Foranytwopointsx1 andx2 onL: β⊤(x1−x2)=0and hence β∗ = β/∥β∥ is a vector normal to L.
􏰔 For any point x0 in L: β⊤x0 = −β0.
􏰔 The signed distance of any point x to L is given by
β∗T(x−x)= 1 􏰈β⊤x+β􏰉 0 ∥β∥ 0
Support Vector Machines 14/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Constructing hyperplanes
􏰔 If the data can be perfectly separated using a hyperplane, then there will exist an infinite number of such hyperplane
􏰔 Any hyperplane can be shifted a tiny bit up or down, or rotated without coming into contact with any observations
􏰔 There will generally be infinitely many possible choices for β0, β.
Support Vector Machines 15/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Possible separating hyperplanes, p = 2. ●


●● ●●



●●

x2
−1.0 −0.5 0.0 0.5 1.0
−1.0 −0.5 0.0
x1
0.5
1.0
Which do you like best? In order to construct a classifier using a hyperplane, we must define a criterion for selecting which
hyperplane to use
Support Vector Machines
16/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Constructing hyperplanes
􏰔 A natural choice is the maximal margin hyperplane also known as the optimal separating hyperplane
􏰔 It is the separating hyperplane that is farthest from the training observations
􏰔 We can compute the (perpendicular) distance from each training observation to a given separating hyperplane
􏰔 The minimal distance from the observations to the hyperplane is known as the margin
􏰔 The maximal margin hyperplane is the separating hyperlpane for which the margin is largest
Support Vector Machines 17/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
We are looking for a hyperplane that is the farthest from the training observations, i.e. has the largest margin.
x2
−1.0 −0.5 0.0 0.5 1.0
Support Vector Machines
18/83
−1.0 −0.5
0.0
x1
0.5
1.0





●● ●●


●●

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Assume we have available a learning set
L = {(xi,yi) : i = 1,2,…,n}
wherexi ∈Rp andyi ∈{−1,+1}.
􏰔 The binary classification problem uses L to construct a
function f : Rp → R so that
G (x) = sign(f(x)), x ∈ Rp is a classifier
.
Support Vector Machines 19/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 The separating function f then classifies each new point x in a test set τ into one of the two classes Π+ or Π− using G (x).
􏰔 G(x)is+1iff(x)≥0or−1iff(x)≤0
􏰔 The goal is to have f assign all positive points in τ (points for which y = +1) to Π+ and the negative points in τ (points for which y = −1) to Π−.
􏰔 In practice, we recognize that 100% correct classification may not be possible.
Support Vector Machines 20/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Let M− be the shortest distance from the separating hyperplane to the nearest negative data point
􏰔 Let M+ be the shortest distance from the same hyperplane to the nearest positive data point
􏰔 The margin of the separating hyperplane is M = M+ + M−
􏰔 When this distance is maximized → maximal margin classifier
Support Vector Machines 21/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 If the training data from the two classes are linearly separable there exist β0 and β such that
β0+x⊤iβ≥+1 if yi=+1 β 0 + x ⊤i β ≤ − 1 i f y i = − 1
􏰔 For data from L for which the first equality holds, these vectors lie on H+1 : (β0 −1)+x⊤β = 0.
􏰔 Similarly for data from L for which the second equality holds, these vectors lie on the hyperplane H−1 : (β0 + 1) + x⊤β = 0.
Support Vector Machines 22/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Points in L that lie on either one of the hyperplanes H+1 or H−1 are called support vectors
􏰔 The support vectors typically consist of a small percentage of the total number of sample points.
Support Vector Machines 23/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Support vectors – the observations that the maximal margin hyperplane depends on directly.
Support Vector Machines 24/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 If x−1 lies on the hyperplane H−1 and if x+1 lies on the hyperplane H+1 then
β0+x⊤−1β=−1 and β0+x⊤+1β=+1 􏰔 The difference gives
and their sum is
x⊤+1β − x⊤−1β = 2
β0 = −1 􏰖x⊤+1β + x⊤−1β􏰗 2
Support Vector Machines 25/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 The perpendicular distances of the hyperplane β0 + x⊤β = 0 from the points x−1 and x+1 are
M−=|β0+x⊤−1β|= 1 and M+=|β0+x⊤+1β|= 1 ∥β∥ ∥β∥ ∥β∥ ∥β∥
􏰔 So, the margin of the separating hyperplane is d = 2/∥β∥.
Support Vector Machines 26/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Combining the inequalities
β0+x⊤iβ≥+1 if yi=+1
β 0 + x ⊤i β ≤ − 1 i f y i = − 1 􏰔 into a single set of inequalities
y􏰈β+x⊤β􏰉≥+1, i=1,…,n i0i
􏰔 The quantity yi 􏰊β0 + x⊤i β􏰋 is called the margin of (xi , yi ) with respect to the hyperplane β0 + x⊤β = 0, i = 1, …, n.
Support Vector Machines 27/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 From
β0+x⊤−1β=−1 and β0+x⊤+1β=+1
􏰔 xi is a support vector with respect to the hyperplane if its
margin equals one; that is
y 􏰈β +x⊤β􏰉=+1 i0i
Support Vector Machines 28/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 The problem of finding the optimal separating hyperplane corresponds to finding the hyperplane that maximizes the margin d = 2/∥β∥ subject to
yi 􏰊β0 +x⊤i β􏰋≥+1, i =1,…,n
􏰔 Equivalently
Minimize 12 ∥β∥2
Subjectto y 􏰈β +x⊤β􏰉≥+1, i=1,…,n i0i
Support Vector Machines 29/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 This is a convex optimization problem (quadratic criterion with linear inequality constraints)
􏰔 Convexity ensures that we have a global minimum without local minima
􏰔 The resulting optimal separating hyperplane is called the maximum margin solution
Support Vector Machines 30/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 We solve this problem using Lagrangian multipliers μ = (μ1,…,μn) ≥ 0
􏰔 The Lagrange function
1 2 􏰅n 􏰖 􏰈 ⊤ 􏰉 􏰗 Lp=2∥β∥− μiyiβ0+xiβ−1
i=1
􏰔 is first minimized with respect to the primal variables β0, β and the maximize the resulting minimum LD with respect to the dual variables μ
Support Vector Machines 31/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Setting the derivatives to zero, we obtain ∂L n
p =−􏰅μiyi =0 ∂β0 i=1
∂Ln n
p =β−􏰅μiyixi =0⇒β=􏰅μiyixi
∂β
i=1 i=1
Support Vector Machines 32/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Substituting these equations gives the dual functional of the optimization problem
1nnnnn
LD = 􏰅􏰅μiμjyiyjx⊤i xj −􏰅􏰅μiμjyiyjx⊤i xj +􏰅μi
2
i=1 j=1 i=1 j=1 i=1 n1nn
=􏰅μi − 􏰅􏰅μiμjyiyj(x⊤i xj)
i=1
2
i=1 j=1
Support Vector Machines 33/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 Next the Lagrangian multipliers μ are obtained by maximizing the dual function subject to
n
μi≥0 and 􏰅μiyi=0
i=1
􏰔 The constrained maximization takes the matrix form
Maximize LD =1⊤nμ−12μ⊤Hμ Subjectto μ≥0, μ⊤y=0
wherey=(y1,…,yn)⊤ andH=(Hij)isann×nmatrixwith Hij =yiyj(x⊤i xj)
Support Vector Machines 34/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 If μˆ solves this optimization, then
n
βˆ = 􏰅 μˆi yi xi is the optimal weight vector
i=1
􏰔 Ifμˆi >0thenyi􏰊β0∗+x⊤i β∗􏰋=1soxi isasupportvector
􏰔 For all observations that are not support vectors μˆi = 0
􏰔 Therefore βˆ is a linear function only of the support vectors
􏰔 The support vectors (only) carry all the information necessary to determine the optimal hyperplane
Support Vector Machines 35/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Construction of the maximal margin hyperplanes
􏰔 The optimal hyperplane can be written as
fˆ ( x ) = βˆ 0 + x ⊤ βˆ = βˆ 0 + 􏰅 μˆ i y i ( x ⊤ x i )
i ∈sv
􏰔 The classification rule applied to the new observation is
Gˆ (x) = sign 􏰖fˆ(x)􏰗
Support Vector Machines 36/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Summary of separating hyperplanes That concludes our look at separating hyperplanes. Note that this is a fairly simple model, and there are a number of extensions that may be introduced. In particular,
􏰔 nonlinear decision boundaries, and
􏰔 allowance for some observations to be on the“wrong”side are important if a method such as this is going to be useful for
general datasets.
How would you add these extensions?
Support Vector Machines 37/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Support Vector Machines 38/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Support vector classifier
We shall address the non-separability issue first, so seek to find a good hyperplane for classification, even if we cannot perfectly allocate the training observations above and below the hyperplane. We shall refer to this model as the support vector classifier.
One natural extension of separating hyperplanes is to
allow some observations to be inside or even be on the wrong side of the margin. However, we add a penalty so that this behaviour is discouraged.
Support Vector Machines 39/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Thus, instead of our original optimisation, maximize M
β1 ,β2 ,…,βn ,M
P
subject to
􏰅 βj2 = 1 , j=1
yi(β0 +x⊤i β)≥M(1−ξi) ∀i =1,…,n, n
ξi≥0, 􏰅ξi≤t, i−1
where, t ≥ 0 is the tuning parameter that determines the severity of misclassification.
M is the margin (the larger the better).
{ξi } are slack variables that allowed individual observation to be misclassified.
Support Vector Machines 40/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Previous example with 5 nonzero slack variables.
Support Vector Machines
x1
41/83
−1.0
−0.5 0.0
0.5
1.0




~
ξ5
●~


ξ ~ ● 4 ●ξ●
● 3 ●

~ξ2
● ~ξ1


x2
−1.0 −0.5 0.0 0.5 1.0

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Support Vector Machines 42/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Thus, instead of our original optimisation, minimise ||β||2 subject to
Y(XTβ+β ) ≥ 1, ∀i. ii0
we introduce slack variables ξ1, . . . , ξn and optimise the modified expression
minimise ||β||2 subject to
Y(XTβ+β ) ≥ 1−ξ, ξ ≥0,∀i;􏰅ξ ≤t.
ii0iii i
Support Vector Machines 43/83

§7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines
Observations with ξi > 0 are thus allowed to satisfy
Yi (XiT β + β0) < 1, which would have violated the original strict separating hyperplane condition. Support Vector Machines 44/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Note that in the above diagram, the margin is C = 1/||β||, so the distances ξ ̃ marked on the diagram equal ξ /||β||. ii Support Vector Machines 45/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines It turns out that this is a very similar optimisation problem to the separating hyperplane case, with some additional constraints relating to the ξi . The full details are not pursued here, but it turns out a slightly modified version of the optimisation problem is easier to work with: 1 2 􏰅n minimise 2||β|| +γ ξi subject to Y (X T β + β ) ≥ 1 − ξ , ξ ≥ 0, ∀i . (1) ii0ii i=1 Support Vector Machines 46/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines When we compute the full KKT conditions, we see in particular that the dual expression is unchanged, and the hyperplane parameter equations is also unchanged, except that active points are now those that lie on the margin or have nonzero slack variable. The classification rule is still the same, G (x ) = sign(x T βˆ + βˆ0 ) . Support Vector Machines 47/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines The choice of γ defines how much influence“bad”observations have. A large γ will lead to a tighter margin, focusing more closely on points nearer the decision boundary. Smaller γ will tend to have a larger margin. Cross-validation can be used to select γ. Support Vector Machines 48/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Support Vector Machines 49/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Support vector machines or SVM So far we have looked at linear boundaries in the feature (predictor variable) space. A further generalisation is to enlarge this space through a basis expansion, h(Xi) = {h1(Xi),...,hM(Xi)}, to enable nonlinear boundaries. The decision boundary is then those points x satisfying h(x)T β + β0 , and classification of a general point x is based on sign{h(x)T β + β0}. This is precisely how we define the support vector machine (SVM). The SVM is usually characterised by a very high dimensional expansion, sometimes infinite. Support Vector Machines 50/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines The main idea is: If we cannot separate by hyperplane in the original space, let’s increase the dimensionality of the space, ensuring that in the ‘new’ higher dimensional space we can separate by the hyperplane. Recall the polynomial regression as an extension of the linear model... Support Vector Machines 51/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines http://www.robots.ox.ac.uk/ az/lectures/ml/lect3.pdf Support Vector Machines 52/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines For example, in case of ‘enrichment’ of the feature space of x1,...,xp byaddingthefeaturesx12,...,xp2,theoptimization problem has the following form maximize M β1 ,β2 ,...,βn ,M subject to p2 􏰅􏰅β2 = 1 , jk j=1 k=1 pp y(β +􏰅β x +􏰅β x2)≥M(1−ξ) i0 j1ij j2ij i j=1 j=1 n ξi≥0, 􏰅ξi≤t, i−1 Support Vector Machines 53/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM & kernels Using kernels is an efficient computational approach to enriching the feature space. It can be shown that the support vector classifier can be written as n f (x ) = β0 + 􏰅 αi ⟨x , xi ⟩, i=1 where, x is the new point of interest, {xi}ni=1 are training points and {αi}ni=1 is a set of additional parameters. Note that the inner product of two obs. xk & xg is given by ⟨xk ,xg⟩, = 􏰄pj=1 xk,j xg,j. Support Vector Machines 54/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM & kernels Therefore, in order to evaluate f (x) on a new point x we need to compute the inner products of the new point, x, and of each of the training points xi ‘s. But, it turn outs that αi is non-zero only for the support vectors. Moreover, inner product is only one type of kernel , a linear kernel. Support Vector Machines 55/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM & kernels So we can rewrite the classifier in a more general way, f (x ) = β0 + 􏰅 αi K (x , xi ), i∈S where K(x,xi) is the kernel and S is the set of indexes of the support vectors. SVM = support vector classifier + non-linear kernel Support Vector Machines 56/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM & kernels Note that using different kernels will result in a different set of transformation/basis functions of x, {hm(x)}. This is great !!! Instead of manually considering all the ‘potentially useful’ transformations of our x’es we can consider to use one of a few common kernels. Support Vector Machines 57/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM & kernels The three most common choices for SVM kernels are 􏰔 Degreed polynomials: K(x,x′)=(C+⟨x,x′⟩)d 􏰔 Radialbasisfunctions: K(x,x′)=exp(−γ||x−x′||2/C) 􏰔 Sigmoidfunctions: K(x,x′)=1/{1+exp(−C1⟨x,x′⟩+C2)} Of these the second is probably the most popular. The third is sometimes expressed using the tanh function, which is equivalent. Support Vector Machines 58/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Example: p = 2 How is decision boundary should look like? ● ●●●● ●● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ● ●● ●●● ●●● ● ●● ●●●●● ●● ●●●● ●● ●● ●●● ●●●● ●●●●●● ●●●●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● Support Vector Machines 59/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines True decision boundary ●● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ●● ● ● ● ● Support Vector Machines 60/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Linear decision boundary (support vector classifier) ●● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ●● Support Vector Machines 61/83 Any good? §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Fitted SVM, Polynomial kernel (degree = 3) ●● ● ● ● ● ●●● ● ● ● ●●●● ●●●● ● ●● ●● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ●● ● ● ● ● ●● Support Vector Machines 62/83 Any good? §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Fitted SVM, Polynomial kernel (degree = 4) ●● ● ● ● ● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●● ●● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●● ●● ●● ● ●● ● ●● ● ● ● ● Support Vector Machines 63/83 Any good? §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Fitted SVM, Polynomial kernel (degree = 5) ●● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ●● Support Vector Machines 64/83 Any good? §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Fitted SVM, Radial basis kernel (chosen γ=1 and C=1.5) ●● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ●● Support Vector Machines 65/83 Any good? §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM with C chosen too small (0.001) ●● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ●● Support Vector Machines 66/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM with C chosen too large (1000) ●● ● ● ● ● ●●● ● ● ● ●●●● ●●● ● ● ●● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ● ●● ●● ●●●●●● ●●●●● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●● ● ● ● ● Support Vector Machines 67/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM with Sigmoid (C=10) ●● ● ● ● ● ●●● ● ● ● ●●●● ●●●● ● ●● ●● ●●●●●● ●●●●● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●●● ●● ● ●● ●●●●● ●● ●●●● ●● ● ● ● ● ●● Support Vector Machines 68/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVM & kernels In this case the (final!) optimisation expression is: n minimise ||β||2+γ􏰅ξi subject to i=1 Yi{h(Xi)Tβ+β0}≥1−ξi,ξi ≥0, ∀i. (2) Support Vector Machines 69/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines At first glance it would appear that choosing a very high dimensional basis expansion might lead to 􏰔 a solution that’s computationally infeasible (how can you compute an infinite dimensional basis expansion vector?), and 􏰔 gross overfitting (e.g. a small region around every observation) It turns out neither is the case. There is often a way to avoid representing the full basis expansion, and while the train error rate may be unrealistically low, there is an implicit penalisation on model complexity in the first line of (2). Support Vector Machines 70/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Suppose we have a basis expansion h that we want to use to fit the SVM boundary. The key step in the optimisation is being able to calculate expressions relating to the Lagrangian dual, similar to, L = 􏰅μi − 1 􏰅􏰅μiμi′YiYi′⟨h(Xi),h(Xi′)⟩, i 2ii′ where ⟨·, ·⟩ is the inner product corresponding to the usual dot product, ⟨x, z⟩ = xT z. We also need to calculate inner products for making predictions, since, using the basis expanded version of, h(x)Tβ+β0 =􏰅μiYi⟨h(x),h(Xi)⟩+β0. i Note that in general the μi and β0 are easy to find, so the main computational issue is computing the inner products. It would be ideal to find a way to do this without having to write down the h(Xi ) explicitly. Choosing an appropriate basis expansion allows us to finesse this calculation. Support Vector Machines 71/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines The inner product ⟨h(x),h(x′)⟩ is typically rewritten at this point as a kernel function, K(x,x′) = ⟨h(x),h(x′)⟩ so K is a symmetric and positive definite function. This is usually done with the theory of reproducing kernel Hilbert spaces (RKHS) in mind, which we do not cover here. Support Vector Machines 72/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Example - computational finesse Supposep=3andwedefineK(x,x′)=(C+⟨x,x′⟩)2. Then K(x,x′) = (C +x1x1′ +x2x2′ +x3x3′)2 = C2 + 2C(x1x1′ + x2x2′ + x3x3′ ) + (x1x1′ )2 + (x2x2′ )2 where +(x3x3′ )2 + 2x1x2x1′ x2′ + 2x1x3x1′ x3′ + 2x2x3x2′ x3′ = ⟨h(x),h(x′)⟩ √√√222 h(x) = {C, 2Cx1, 2Cx2, 2Cx3,x1,x2,x3, √2x1x2, √2x1x3, √2x2x3}T Support Vector Machines 73/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Notice that for our particular choice of kernel K, we have effectively obtain a basis expansion h corresponding to all monomials up to degree 2 (constant, linear, quadratic and pairwise interactionterms). SimilarlyK(x,x′)=(C+⟨x,x′⟩)d givesall monomial terms up to degree d. In this case there would be 􏰌p + d􏰍 p p≈d such monomials, the number of which grows exponentially fast. Support Vector Machines 74/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines If we were to try to calculate ⟨h(x),h(x′)⟩ explicitly by calculating h(x) and h(x′) first, then we would require O(dp) operations. However, we could instead calculate ⟨x,x′⟩ and then K (x , x ′ ) = (C + ⟨x , x ′ ⟩)d , which gives the same answer, in O(p) time. Thus it is possible to choose a kernel and check to see whether it has a nice underlying basis expansion interpretation. Support Vector Machines 75/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines These will give nonlinear decision boundaries, and the parameter γ in (2) has a significant role. If γ is large, then having positive ξi are discouraged and an overfit super-wiggly boundary is produced. Small γ allows more ξi to be nonzero and so encourages ||β|| to be small, resulting in a simpler model. An appropriate choice can be found by cross-validation. Note that the kernel will often necessitate other parameters to be estimated as well. For instance, the C for radial basis functions will affect how spiky the decision boundary can look, and so the cross-validation would usually occur on a grid of possible values for C and γ. Support Vector Machines 76/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines How exactly SVM deals with the high dimensionality? Support Vector Machines 77/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVMs and penalised loss functions Consider the optimisation problem of finding n min􏰅[1−Yi{h(Xi)Tβ+β0}]+ +λ||β||2. (3) β,β0 i=1 The expression [t]+ means 0 if negative, or t otherwise. This is a total loss expression plus penalty. What is the loss function? What role does λ play? Support Vector Machines 78/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines The optimisation problems (2) and (3) are equivalent. Thus the solution to (3) yields the SVM decision boundary. Proof: The optimal choice of β and β0 are unchanged if we apply a scalar multiplication, 1 􏰅n 1 min [1−Yi{h(Xi)Tβ+β0}]+ + ||β||2. (4) β,β0 2λ 2 i=1 For the optimal choice of ξi in (2), we have 􏰒 1−Yi[h(Xi)Tβ+β0] ifYi[h(Xi)Tβ+β0]<1 ξi = 0 otherwise Thus (4) can be rewritten as min β,β0 2λ ξi + ||β|| . 2 1􏰅n 12 i=1 Setting γ = 1/(2λ) completes the proof. Support Vector Machines 79/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Thus we can consider an SVM as fitting a basis expansion model f(Xi)=h(Xi)Tβ+β0 to predict Yi, using loss [1 − Yi f (Xi )]+ and a ridge penalty (shrinking towards zero) ||β||2. This particular loss, called SVM or hinge loss, gives all observations classified correctly beyond a certain point 0 loss, and linearly increases in f (Xi ) for poor predictions. Support Vector Machines 80/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines Binary classification loss functions We may add the following to the family of loss functions for binary classification. Here Yi ∈ {−1, 1} and f (Xi ) are predictions with sign{f (Xi )} being the classifier. 􏰔 Misclassification loss: I[Yi ̸= sign{f (Xi )}] . 􏰔 Binomial loss: − log 􏰖 1 􏰗. 1+e−2Yi f (Xi ) 􏰔 Exponential loss: exp{−Yi f (Xi )} . 􏰔 SVM loss: [1−Yif(Xi)]+. 􏰔 􏰖 e2f(Xi) 􏰗2 Squared loss: Yi − 1+e2f (Xi ) . Support Vector Machines 81/83 §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines We can view the various loss functions against Yi f (Xi ): Misclassification Exponential Binomial Hinge 0.0 1.0 2.0 3.0 Loss Support Vector Machines 82/83 −2 −1 0 1 2 yf(x) §7.1 Separating Hyperplanes §7.1 Support vector classifier §7.1 Support vector Machines SVMs for regression The most analogous regression model in light of (3) is the one that minimises 􏰅n λ L{Yi,h(Xi)Tβ+β0}+ 2||β||2. i=1 Have we seen a similar expression before? The kernel trick may be used to simplify calculations once again, rather than having to write out the full basis expansion. In principle any loss function could be used. One possible choice that retains the spirit of the SVM is: 􏰒 0 if |y − f (x)| < ε Lε{y,f(x)}= |y−f(x)|−ε otherwise. This means there is a margin ε that as long as the observed values lie within, there is no loss. Outside this however, the loss grows linearly. Support Vector Machines 83/83