Foundations of Machine Learning: Support Vector Machines and Kernels
Srinandan Dasmahapatra
December 2020
1/33
Large-Margin Classifiers
If points with distinct labels linearly separable, there could be many decision boundaries
Choose one that leaves the largest margin between classes
Represent decision boundaries and margins by linear equations
Constrained quadratic optimisation
Not perfectly separable – accommodate a minimal number of mismatches
Slack variables
Learning with Kernels
by B Schölkopf and A Smola
2/33
Large-Margin Classifiers
If points with distinct labels linearly separable, there could be many decision boundaries
Choose one that leaves the largest margin between classes
Represent decision boundaries and margins by linear equations
Constrained quadratic optimisation
Not perfectly separable – accommodate a minimal number of mismatches
Slack variables
Learning with Kernels
by B Schölkopf and A Smola
3/33
Reminder: classifying 2 gaussians
Recall, 2-dim. Gaussian: p(x1,x2|C = a) =
Qa(x1,x2) =
[ ]
|Λa| exp
2π 2
−1Qa(x1,x2)
(x−μ )⊤(Λ Λ )(x−μ )
1 a1 a11 a12 1 a1 x2 − μa2 Λa21 Λa22 x2 − μa2
Log of posterior distribution of class a is ga(x), gb(x) for class b
Decision boundary for Λa = Λb: ()
0 = ga(x)−gb(x) = ln
P(Cb)
+(μa − μb)T Λx + x-independent term Separating hyperplane
P(Ca)
4/33
A simple pattern classifier: decision boundary is a hyperplane
Centroids (means) of data from each class c±
w = (c− − c+): vector joining centroids
c = 1 (c− + c+ ) midpoint of centroids 2
For data point xn compare angle((xn − c), w) to π/2:
yn = sign wT (xn − c) =⇒
yn =sign ((c+ −c+)Txn +b)
where b ≜ 1 (∥c−∥2 − ∥c+∥2). 2
5/33
Rosenblatt’s perceptron (1958): binary classification
yn =sign(b+w⊤x)
6/33
Rosenblatt’s perceptron algorithm finds hyperplane
7/33
Reminder: equation for hyperplane (here line)
8/33
Discriminant analysis, binary classification by logistic regression
9/33
Deep neural networks from input xn to softmax output yˆn
Input xn = (x0,x1,…,xd)n Softmax output from penultimate layer
wl
Φ ( x n ) − −→ { yˆ 1 , . . . , yˆ c }
Softmax loss (yˆn, yn) Last layer weights wl
Adaptive feature map from input xn → Φ(xn)
This lecture: kernel map – non-adaptive features
10/33
Decision boundary in terms dot products of input data
y(x)=sign ((c+ −c−)Tx+b),b= 1(∥c−∥2 −∥c+∥2). 2
1∑1∑ substitutec+ = N xn ,c− = N xn
+ {n|yn=+1} − {n|yn=−1}
This reduces to expression in terms of dot products (xTnx) between unseen x and
xn from the training set D = {xn}n=1…,N.
Introduce k : Rp × Rp → R a linear kernel that takes 2 vectors as input to
produce a number as output; k(a, b) = aT b, a dot product for a, b ∈ Rp. Decision boundary rewritten as
y(x) = b =
{∑∑}
sign 1 k(x, xn) − 1 k(x, xn) + b , { N+ n,pos N− n,neg }
11∑ 1∑
2 N2 k(xm,xn)− N2 k(xm,xn)
− neg pairs + pos pairs
11/33
Hyperplane (wT x + b) classifier in dot product space
y(x)=sign (wTx+b),w,x∈H,adotproductspace,b∈R,isadecision function assuming linearly separable data x
(wT x + b) = 0 is a separating hyperplane
Many such hyperplanes, but a unique optimal hyperplane maximises separation
between classes:
maximise min{∥x−xn∥ |x ∈ H,wTx+b = 0,n−1,…,N}
w∈H,b∈R
Different method for training compared to previous one based on centroids
To find optimal max margin hyperplane,
minimise 1∥w∥2 s.t. yn(wTxn +b)⩾1, foralln=1,…,N. w∈H,b∈R 2
12/33
Large margins for classification
13/33
Explaining the max margin, min ∥w∥ equivalence
3 hyperplanes have direction w = w . Points x+ and x− on ∥w∥
the two hyperplanes on either side of the separating hyperplane.
wTx+ +b=+1andwTx− +b=−1
Subtracting, wT (x+ − x−) = 2, hence the dot product of the unit vector w with
separation vector (x+ − x−) satisfies
w T ( x + − x − ) = 2 .
∥w∥
Perpendicular distance (direction w) between 2 hyperplanes through data points
closest to separating hyperplane, parallel to it is 2 . This distance is called the ∥w∥
margin.
To maximise this margin, we minimise length of weight w.
14/33
Why large margins?
Consider a hyperplane defined by (w, b) := ( w , b ) and compute ∥w∥ ∥w∥
y(wT x + b). Multiplying by y: +-ve for correctly classified point x, −-ve otherwise. w T x = ∥w ∥∥x∥ cos(θx,w ) = ∥x∥ cos(θx,w ) is the length of x as projected along
direction perpendicular to hyperplane and distance to it obtains by adding offset bˆ. Assume training and test points taken from same generating process.
Test points are “noisy versions” of training points:
(xtest, yn) := (xn + ∆x, yn), so test points in neighbourhood of training points should have same targets if r := ∥∆x∥ < ρ.
15/33
Optimal margin hyperplanes: to find w, b introduce Lagrange multipliers
Input data ((x1,y1)...,(xN,yN), xn ∈ H yn ∈ {+1,−1}, with at least 1 positive and 1 negative example. To find parameters w, b so that
y = f(x; w, b) = sign(wT x + b)
To find optimal max margin hyperplane, the primal optimisation problem is: minimise 1∥w∥2 s.t. yn(wTxn +b)⩾1, foralln=1,...,N.
w∈H,b∈R 2
This is performed by introducing L(w, b, α):
minmaxL(w,b,α)= ∥w∥ − w,b α 2
n=1
αn yn(w xn +b)−1
1 2 ∑N ( T )
16/33
Max margin to minimax of Lagrangian: the rationale
min max L(w,b,α) = 1∥w∥2 −∑N α (y (wTx +b)−1) w,bα 2 n=1nnn
For each constraint y (wT x + b) − 1 ⩾ 0, introduce a Lagrange multiplier
nn
αn ⩾0andtrytomaximise−αn yn(wTxn +b)−1
()
If a constraint is violated (the classifier gets the answer wrong)
(y (wTx +b)−1)<0 =⇒ −α (y (wTx +b)−1)>0, nn nnn
and the constraint term increases for increasing αn. Amplifies error signal.
At the same time, to minimise L(w, b, α), adjust (w, b) to move hyperplane to
prevent constraint violations (reduce error) if data linearly separable.
If constraint is respected, −α (y (wT x + b) − 1) < 0; increasing α makes L
nnnn arbitrarily large negative number. Hence αn is set to 0 for those data points
(xn, yn) away from the margin.
17/33
Minimise with respect to w and b
min max L(w,b,α) = 1∥w∥2 −∑N α (y (wTx +b)−1) w,bα 2 n=1nnn
Partial derivatives w.r.t. primal variables w, b must vanish: ∇wL(w,b,α) = 0 = ∂ L(w,b,α)
Therefore,
∂b
w =
∑N
∑N
αnyn = 0
and
αnynxn.
n=1
Weights w linear combinations of input data (xn, yn).
If constraints yn(wT xn + b) − 1 > 0 are met, corresponding αn = 0. Inputs xn for which αn > 0 are called support vectors.
n=1
18/33
Dual optimisation problem
minmaxL(w,b,α)= ∥w∥ − w,b α 2
αn yn(w xn +b)−1
αnyn = 0
Substitute ♠ into L(w, b) to obtain maximisation problem
∑N 1∑N
αn − subject to αn ⩾ 0 and ∑Nn=1 αnyn = 0.
1 2 ∑N ( T )
n=1
is called the primal problem with primal variables w and b.
Minimax leads to
∑N n=1
∑N n=1
αnynxn. ♠
and
w =
αnαmynymx⊤n xm Maximise with respect to Lagrange multipliers (dual variables).
max W(α) = α∈RN
n=1
2
n,m=1
19/33
Constrained Optimisation
min f(x) subject to ci(x) ⩽ 0, ej(x) = 0 x
∇xf(x) = 0 too restrictive: solution x may not satisfy constrints ci, ei. Optimisation problem: minx f(x) subject to n constraints ci ⩽ 0 expressed as
minimisation of Lagrangian L(x, α) = f(x) + ∑ni=1 αi ci (x) for αi ⩾ 0.
For learning, the quantities to optimise over are weights, hence x above are w.
20/33
KT saddle point conditions for constrained optimisation†
Kuhn-Tucker Saddle Point Condition: If there exists a pair of variables (x ̄,α ̄) with x ̄ ∈ Rp and α ̄i ⩾ 0 such that
L(x ̄,α) ⩽ L(x ̄,α ̄) ⩽ L(x,α ̄),
for αi ∈ [0,∞),x ∈ Rp then x ̄ is a solution to the optimisation problem.
21/33
Example of Kuhn-Tucker saddle point condition†
Suppose you wish to minimise f(x) = x2 for x ⩾ 2. Rewrite: x ∈ X = {x| − x + 2 ⩽ 0, x ∈ R }. Minimum:
x ̄ = argminx∈X f(x) = 2.
Necessary and sufficient condition that x ̄ minimises f(x): there is a number α ̄ s.t. for all x, α ∈ R, α ⩾ 0
Specialise from the general form L → f below:
L(x ̄,α) ⩽ L(x ̄,α ̄) ⩽ L(x,α ̄),
For f(x) = x2,
f(x ̄) + α(−x ̄ + 2) ⩽ f(x ̄) + α ̄(−x ̄ + 2) ⩽ f(x) + α ̄(−x + 2).
Verify (x ̄, α ̄) = (2, 4). 4 ⩽ x2 + α ̄(−x + 2) or x2 − α ̄x + 2(α ̄ − 2) ⩾ 0. If α ̄ = 4, the expression is a complete square: (x − 2)2 ⩾ 0.
22/33
Substitute KKT primal solution into L gives dual problem
min max L(w,b,α) = 1∥w∥2 −∑N α (y (wTx +b)−1) w,bα 2 n=1nnn
Setting partial derivatives w.r.t. primal variables w, b to 0 gives solution
∑N n=1
∑N n=1
αnyn = 0
Substitute solution in L leads to optimisation problem in terms of αi:
max W(α) = α∈RN
∑N 1∑N
αmαnymynxTmxn
αn −
2
and
w =
αnynxn.
subjecttoαn ⩾0and∑ αnyn =0. n
n=1
n,m=1
23/33
Express decision boundary in terms of dot products of data
Substitute for w, b from KKT primal solution in
(∑N )
f(x) = sign (wT x + b) = sign n=1 αnyn(xTnx) + b
For b, solve for yn (x⊤n w + b) = 1, (the support vector condition).
Dot product: maps a pair of vectors w, v ∈ Rp to a real number k(w, v) = w⊤v.
In general k(w, v) = w⊤Kv = ∑
For some problems linear functions of x insufficient. Feature space enlarged to
i,j contain non-linear functions of x
Kijwivj
Need to work with dot products of feature vectors Φ(x1)TΦ(x2). (∑N ⊤ )
f(x) = sign n=1 αn ynΦ(xn ) Φ(xn ) + b
Kernel trick: Φ(x1 )T Φ(x2 ) = k(x1 , x2 ) for suitable kernels k(·, ·). Do not have to compute dot products in high dimensions.
24/33
Non-linear decision boundary is linear in higher dimension
Points x on the 2-D plane are mapped to 3-D space:
R2 ∋ x → Φ(x) ∈ R3 : (x1,x2) → Φ(x1,x2) = (x1,x2,x21 + x2).
Linear decision boundary (plane) in 3-dimensions:
0 = b+wTΦ(x) = b+w1x1 +w2x2 +w3(x21 +x2)
25/33
Linear boundaries with non-linear features: R2 to R6
Non-linear basis components of
Φ(x1,x2) = (φ1,…,φ6)
Φ(x1, x2) = (1, √2×1, √2×2, x21, √2x1x2, x2).
Linear decision boundary (plane) in 3-dimensions is the set of points
{(φ4, φ5, φ6)|φ4 + φ5 = 1}.
26/33
Dot products of high-dim features via kernel map
Define feature map: R2 ∋ x → Φ(x) ∈ R6 with (x1,x2) → Φ(x1,x2) = (φ1,…,φ6) and
Φ(x1, x2) = (1, √2×1, √2×2, x21, √2x1x2, x2).
If we have two data points x = (x1,x2) and y = (y1,y2) in R2 the dot product of
their images in R6 is
Φ(x) · Φ(y) =
Dot product in data space (R2) yields dot product in R6 without having to work in R6. Called kernel trick, with k(·, ·) a kernel defined over data vectors {xn}: k(xm, xn) = Φ(xm) · Φ(xn).
∑6 i=1
φi(x) · φi(y) = (1 + (x · y))2.
27/33
Data not linearly separable: slack variables for soft margins
Slack variables: distance
ξn ⩾ 0 of mis-classified points from margin.
C > 0 controls tradeoff between maximising margin and reducing training error.
Relax constraints to yn(w⊤wn + b) ⩾ 1 − ξn. Trivial: make ξn big (a lot of slack!)
Need to penalise large ξn.
Minimise 1 ∥w∥2 + C ∑N ξn subject to
2 Nn=1 constraints.
28/33
Loss minimisation of soft-margin classifier
Model yˆn = (w⊤xn + b) 0/1-loss I((yyˆ) < 0) not convex approximate by convex losses
hinge-loss max(0, 1 − yyˆ)
squared-loss max(0, 1 − yyˆ)2
logistic (softmax) 1 ln(1 + exp(−yyˆ)) ln2
If ynyˆn > 0, ξn = 0 (no slack if correct).
If ynyˆn < 0 (incorrect), ξn = 1 − yn(w⊤xn + b).
Thus, ξn = max(0, 1 − yn(w⊤xn + b)).
Hence, minimise 1 ∥w∥2 + C ∑N max(0, 1 − yn(w⊤xn + b)). 2 Nn=1
C > 0 controls tradeoff between maximising margin and minimising training error.
29/33
Dual optimisation: soft-margin classifiers and kernels
Dual optimisation problem: subject to constraints (C/N) ⩾ αn ⩾ 0 and ∑ Nn = 1 α n y n = 0
∑N 1∑N
max W(α) = α∈RN
αnαmynymx⊤n xm (∑N
αn −
Upper bound C limits influence of any data point (regularization).
n=1
2
n,m=1
Decision function (with kernel) f(x) = sign n=1 ynαnk(x, xn) + b Threshold b obtained from condition that for all support vectors with αn < (C/N)
.
have slack variables ξn = 0 (shift b to make SVs with ξn = 0 lie on ±1 hyperplanes):
∑N
αnynk(xn, x) + b = yn.
n=1
)
30/33
Support vector regression
For regression, introduce a “tube” of radius ε to fit data: minimise
2∥w∥ + C
with loss |y − f(x)|ε = max{0, |y − f(x)| − ε}.
1 2 ∑N
|yn − f(xn)|ε,
n=1
31/33
Kernels for regression, PCA, etc.
Any dot product based algorithms generalised via kernel trick.
Correlations between features obtained from dot products. Hence, PCA, which uses correlations in data, extended to Kernel PCA:
1 ∑N
Covφ = N
Can be shown that any v = ∑Nn=1 αnΦ(xn), spanned by {Φ(xn)}n and the projections
n=1
Φ(xn)Φ(xn)⊤, Covφvi = λivi.
⊤ ∑N vi Φ(x) =
n=1
αn,ik(xn, x).
32/33
Kernel based methods - moving on
Simplest SV method viewed as extension of perceptron algorithm (decision boundary found iteratively using data)
Perceptrons lead naturally into neural networks
Max margin finds optimal separating hyperplanes, lots of theorems on generalisation ability
Kernels k(x, ·) are powerful feature maps from data to high dimensions (special Hilbert spaces)
Reap benefits of high dimensions while performing computations in low-dim data space.
Large data set problems are now the domain of deep learning – multi-layer perceptrons.
Penultimate layer contain feature maps learned automatically
33/33