In [1]:
from __future__ import division, print_function, unicode_literals
%matplotlib inline
#%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
import numpy as np
import math
import scipy.stats as stats
Quadratic programs¶
We will be using [https://cvxopt.org/] for the key optimisation step that implements the dual formulation of the SVM maximum margin optimisation.
You will need to install cvxopt. I invoke “conda install cvxopt” from the command line. See [https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/] as well.
General formulation: Find the values of the unknown variables
x
i
x
i
for which a quadratic function of those variables is at the desired optimum (minimum) value, subject to constraints.
min
x
subject to
1
2
x
T
Qx+
p
T
x
Gx⪯h (elementwise inequality)
Ax=b.
min
x
1
2
x
T
Q
x
+
p
T
x
subject to
G
x
⪯
h
(elementwise inequality)
A
x
=
b
.
Example for you to check and confirm that the answer (shown below) makes sense:
min
x
subject to:
2
x
2
1
+
x
2
2
+
x
1
x
2
+
x
1
+
x
2
x
1
≥0
x
2
≥0
x
1
+
x
2
=1
min
(
x
1
,
x
2
)
subject to:
1
2
(
x
1
x
2
)
[2(
2
(1/2)
(1/2)
1
)]
Q
(
x
1
x
2
)+
(11)
p
T
(
x
1
x
2
)
(
−1
0
0
−1
)
G
(
x
1
x
2
)⪯
0
h
(11)
A
(
x
1
x
2
)=
1.0
b
min
x
2
x
1
2
+
x
2
2
+
x
1
x
2
+
x
1
+
x
2
subject to:
x
1
≥0
x
2
≥0
x
1
+
x
2
=1
min
(
x
1
,
x
2
)
1
2
(
x
1
x
2
)
[
2
(
2
(1
/
2)
(1
/
2)
1
)
]
⏞
Q
(
x
1
x
2
)
+
(
1
1
)
⏞
p
T
(
x
1
x
2
)
subject to:
(
−1
0
0
−1
)
⏞
G
(
x
1
x
2
)
⪯
0
⏞
h
(
1
1
)
⏟
A
(
x
1
x
2
)
=
1.0
⏟
b
In [2]:
from cvxopt import matrix as cvx_matrix
from cvxopt import solvers as cvx_solvers
In [3]:
Q = 2*cvx_matrix([ [2, .5], [.5, 1] ])
p = cvx_matrix(np.ones(2))
G = cvx_matrix(-np.diag(np.ones(2)))
h = cvx_matrix(np.zeros(2))
A = cvx_matrix(np.ones((1,2)))
b = cvx_matrix(1.0)
cvx_solvers.options[‘show_progress’] = False
sol=cvx_solvers.qp(Q, p, G, h, A, b)
In [4]:
solx=np.asarray(sol[‘x’]) # solver returns this
npQ = np.asarray(Q)
npP = np.asarray(p)
print(“The solution gives the components of the vector x:\n”, solx)
print(“Optimum value of the objective function is “, 0.5*np.dot(solx.T,npQ.dot(solx)) + np.dot(npP.T,solx))
The solution gives the components of the vector x:
[[0.2500001]
[0.7499999]]
Optimum value of the objective function is [[1.875]]
In [5]:
# Check what the solver returns
print(sol[‘x’], sol[‘primal objective’])
[ 2.50e-01]
[ 7.50e-01]
1.875000000000018
In [6]:
# Create simple data set for illustration
x_c1 = np.asarray([[0.,1.],[-1.,-1.],[-1.,2.],[-2.,0.]])
x_c2 = np.asarray([[1.,-1.],[4.,2.],[3.,0.],[2.,2.]])
y_c1=-np.ones(len(x_c1))
y_c2=np.ones(len(x_c2))
# Let’s see what this toy training set looks like.
fig, ax = plt.subplots(figsize=(6,12))
ax.scatter(x_c1[:,0],x_c1[:,1], c=’b’)
ax.scatter(x_c2[:,0],x_c2[:,1], c=’r’)
ax.grid(color=’k’, linestyle=’dotted’, linewidth=0.2)
ax.set_aspect(0.5)

SVM dual optimisation in the language of quadratic programs¶
General formulation:
max
x
subject to
1
2
x
T
Qx+
p
T
x
Gx⪯h (elementwise inequality)
Ax=b.
max
x
1
2
x
T
Q
x
+
p
T
x
subject to
G
x
⪯
h
(elementwise inequality)
A
x
=
b
.
Dual formulation of optimisation problem in SVM (refer to lecture notes/slides or the Classification chapter of FCML):
max
α∈
R
N
or, equivalently,
min
α∈
R
N
subject to:
∑
n=1
N
α
n
−
1
2
∑
m,n=1
N
α
m
(
y
m
x
⊤
m
x
n
y
n
)
α
n
−
∑
n=1
N
α
n
+
1
2
∑
m,n=1
N
α
m
(
y
m
x
⊤
m
x
n
y
n
)
α
n
α
n
≥0,n=1,…,N
∑
n=1
N
α
n
y
n
=0
min
(
α
1
,…,
α
N
)
subject to:
1
2
(
α
1
α
2
…
α
N
)
⎛
⎝
⎜
⎜
⎜
⎜
⎜
y
1
y
1
(
x
1
⋅
x
1
)
y
2
y
1
(
x
2
⋅
x
1
)
⋮
y
N
y
1
(
x
N
⋅
x
1
)
y
1
y
2
(
x
1
⋅
x
2
)
y
2
y
2
(
x
2
⋅
x
2
)
⋮
y
N
y
2
(
x
N
⋅
x
2
)
⋯
⋯
⋱
⋯
y
1
y
N
(
x
1
⋅
x
N
)
y
2
y
N
(
x
2
⋅
x
N
)
⋮
y
N
y
N
(
x
N
⋅
x
N
)
⎞
⎠
⎟
⎟
⎟
⎟
⎟
Q
⎛
⎝
⎜
⎜
⎜
⎜
α
1
α
2
⋮
α
N
⎞
⎠
⎟
⎟
⎟
⎟
+
(−1−1…−1)
p
T
⎛
⎝
⎜
⎜
⎜
⎜
α
1
α
2
⋮
α
N
⎞
⎠
⎟
⎟
⎟
⎟
⎛
⎝
⎜
⎜
⎜
⎜
⎜
−1
0
⋮
0
0
−1
⋮
0
⋯
⋯
⋱
…
0
0
⋮
−1
⎞
⎠
⎟
⎟
⎟
⎟
⎟
G
⎛
⎝
⎜
⎜
⎜
⎜
α
1
α
2
⋮
α
N
⎞
⎠
⎟
⎟
⎟
⎟
⪯
⎛
⎝
⎜
⎜
⎜
⎜
0
0
⋮
0
⎞
⎠
⎟
⎟
⎟
⎟
h
(
y
1
y
2
⋯
y
N
)
A
⎛
⎝
⎜
⎜
⎜
⎜
α
1
α
2
⋮
α
N
⎞
⎠
⎟
⎟
⎟
⎟
=
(0)
b
max
α∈
R
N
∑
n
=
1
N
α
n
−
1
2
∑
m
,
n
=
1
N
α
m
(
y
m
x
m
⊤
x
n
y
n
)
α
n
or, equivalently,
min
α∈
R
N
−
∑
n
=
1
N
α
n
+
1
2
∑
m
,
n
=
1
N
α
m
(
y
m
x
m
⊤
x
n
y
n
)
α
n
subject to:
α
n
≥0,n=1,…,N
∑
n
=
1
N
α
n
y
n
=
0
min
(
α
1
,…,
α
N
)
1
2
(
α
1
α
2
…
α
N
)
(
y
1
y
1
(
x
1
⋅
x
1
)
y
1
y
2
(
x
1
⋅
x
2
)
⋯
y
1
y
N
(
x
1
⋅
x
N
)
y
2
y
1
(
x
2
⋅
x
1
)
y
2
y
2
(
x
2
⋅
x
2
)
⋯
y
2
y
N
(
x
2
⋅
x
N
)
⋮
⋮
⋱
⋮
y
N
y
1
(
x
N
⋅
x
1
)
y
N
y
2
(
x
N
⋅
x
2
)
⋯
y
N
y
N
(
x
N
⋅
x
N
)
)
⏞
Q
(
α
1
α
2
⋮
α
N
)
+
(
−
1
−
1
…
−
1
)
⏞
p
T
(
α
1
α
2
⋮
α
N
)
subject to:
(
−1
0
⋯
0
0
−1
⋯
0
⋮
⋮
⋱
⋮
0
0
…
−1
)
⏞
G
(
α
1
α
2
⋮
α
N
)
⪯
(
0
0
⋮
0
)
⏞
h
(
y
1
y
2
⋯
y
N
)
⏟
A
(
α
1
α
2
⋮
α
N
)
=
(
0
)
⏟
b
Setting up the matrices from training data to fit QP format¶
We will stack the data from both classes together to create the arrays Xtr and Ttr for the inputs and the targets, respectively. The numpy function used is vstack. We then set up the product that will go into the matrix of dot products
y
m
y
n
(
x
m
⋅
x
n
)
y
m
y
n
(
x
m
⋅
x
n
)
. (FYI: a matrix of dot products is called a Gram matrix or a kernel matrix.) To do this, first we create an array of
y
n
x
n
y
n
x
n
arrays, one for each data point; next, we compute the matrix of dot products,
Q
Q
.
In [7]:
X=np.vstack([x_c1, x_c2])
y=np.atleast_2d(np.hstack([y_c1, y_c2])).T
print(“Shapes of X (p features) and y (label) should be (N-by-p) and (N-by-1): “,(X.shape, y.shape))
yX = np.multiply(y,X)
Qmat = np.dot(yX, yX.T) # matrix of dot products (N-by-p) times (p-by-N) = (N-by-N)
Shapes of X (p features) and y (label) should be (N-by-p) and (N-by-1): ((8, 2), (8, 1))
In [8]:
# Compare Qmat computed via numpy multiply followed by dot with an explicit for-loop based computation
mat = np.zeros((len(X), len(X)))
for i in range(len(X)):
for j in range(len(X)):
mat[i,j] = y[i]*y[j]*np.dot(X[i],X[j]) # compare with the matrix in the quadratic program setup above
print(mat-Qmat)
[[ 0. 0. 0. 0. 0. 0. -0. 0.]
[ 0. 0. 0. 0. -0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. -0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. -0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. -0. 0. 0. 0. 0. 0.]
[-0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0.]]
In [9]:
Q = cvx_matrix(Qmat)
p = cvx_matrix(-np.ones(len(Qmat)))
G = cvx_matrix(-np.diag(np.ones(len(Qmat))))
h = cvx_matrix(np.zeros(len(Qmat)))
A = cvx_matrix(y.T)
b = cvx_matrix(0.0)
cvx_solvers.options[‘show_progress’] = False
sol = cvx_solvers.qp(Q, p, G, h, A, b)
alpha = np.asarray(sol[‘x’])
In [10]:
print(alpha, alpha.shape, “\n objective “, sol[‘primal objective’])
[[7.99999987e-01]
[1.23762674e-08]
[2.12336589e-10]
[6.21119410e-10]
[4.00000011e-01]
[1.27807221e-09]
[3.04534590e-10]
[3.99999988e-01]] (8, 1)
objective -0.7999999897910733
The primal problem yields the optimality conditions (from the lecture, after taking derivatives of the Lagrangian which involves the norm of the weight vector and the constraints from each data point)
∑
n=1
N
α
n
y
n
=0 and w=
∑
n=1
N
α
n
y
n
x
n
.
∑
n
=
1
N
α
n
y
n
=
0
and
w
=
∑
n
=
1
N
α
n
y
n
x
n
.
In [11]:
print(“The shapes of alpha, y and x are “, alpha.shape, y.shape, X.shape, “\n”)
print(“Checking Sum of alpha_n y_n is zero: “, np.dot(y.T,alpha))
The shapes of alpha, y and x are (8, 1) (8, 1) (8, 2)
Checking Sum of alpha_n y_n is zero: [[-1.11022302e-16]]
In [12]:
wt = np.dot(alpha.T,yX)
print(“Optimal hyperplane defined by weights “, wt)
Optimal hyperplane defined by weights [[ 1.20000001 -0.40000001]]
In [13]:
# Finding the support vectors: they correspond to “non-zero” values of alpha
print(“alpha =\n”,alpha)
SV = (alpha > 1e-6).flatten()
print(SV)
print(“Non-zero Lagrange multipliers “, alpha[SV].flatten())
alpha =
[[7.99999987e-01]
[1.23762674e-08]
[2.12336589e-10]
[6.21119410e-10]
[4.00000011e-01]
[1.27807221e-09]
[3.04534590e-10]
[3.99999988e-01]]
[ True False False False True False False True]
Non-zero Lagrange multipliers [0.79999999 0.40000001 0.39999999]
All data points
x
n
x
n
satisfy the constraint
y
n
(
w
⊤
x
n
+b)−1≥0,n=1,…,N.
y
n
(
w
⊤
x
n
+
b
)
−
1
≥
0
,
n
=
1
,
…
,
N
.
Since each support vector
x
+
x
+
and
x
−
x
−
lies on a hyperplane defined by
w
w
and a value of the intercept
b
b
, we obtain the value of
b
b
by solving the equalities
y
+
(
w
⊤
x
+
+b)=1 and
y
−
(
w
⊤
x
−
+b)=1,
y
+
(
w
⊤
x
+
+
b
)
=
1
and
y
−
(
w
⊤
x
−
+
b
)
=
1
,
which gives
b=
y
+
−
w
⊤
x
+
or b=
y
−
−
w
⊤
x
−
.
b
=
y
+
−
w
⊤
x
+
or
b
=
y
−
−
w
⊤
x
−
.
In [14]:
print(“The shapes of alpha, y and x are “, alpha.shape, y.shape, X.shape)
print(“The shapes of the wt and the support vectors are “, np.shape(wt), np.shape(X[SV]))
print(“y[SV] = “,y[SV], “\nwt = “, wt, “\nX[SV] = “, X[SV])
print(X[SV].dot(wt.T))
b = y[SV] – X[SV].dot(wt.T)
print(b)
The shapes of alpha, y and x are (8, 1) (8, 1) (8, 2)
The shapes of the wt and the support vectors are (1, 2) (3, 2)
y[SV] = [[-1.]
[ 1.]
[ 1.]]
wt = [[ 1.20000001 -0.40000001]]
X[SV] = [[ 0. 1.]
[ 1. -1.]
[ 2. 2.]]
[[-0.40000001]
[ 1.60000001]
[ 1.6 ]]
[[-0.59999999]
[-0.60000001]
[-0.6 ]]
The decision boundary is
w
⊤
x+b=0
w
⊤
x
+
b
=
0
. For the two dimensional case we have
w
1
x
1
+
w
2
x
2
+b=0
w
1
x
1
+
w
2
x
2
+
b
=
0
. We will express
x
2
x
2
as a function of
x
1
x
1
with parameters
w
1
,
w
2
,b
w
1
,
w
2
,
b
and plot it below.
x
2
=−
w
1
w
2
x
1
−
b
w
2
.
x
2
=
−
w
1
w
2
x
1
−
b
w
2
.
In [15]:
(slope, intercept) = (-wt[0,0]/wt[0,1], -b[0,0]/wt[0,1])
In [16]:
X1 = np.linspace(-2,4,20)
fig, ax = plt.subplots(figsize=(4,12))
ax.scatter(x_c1[:,0],x_c1[:,1], s=80, facecolors=’none’, edgecolors=’b’)
ax.scatter(x_c2[:,0],x_c2[:,1], s=80, facecolors=’none’, edgecolors=’r’)
ax.grid(color=’k’, linestyle=’dotted’, linewidth=0.2)
ax.set_aspect(1.0)
ax.plot(X1,intercept + slope*X1 + (1/wt[0,1]) , “:”, alpha=0.5)
ax.plot(X1,intercept + slope*X1 – (1/wt[0,1]), “:”, alpha=0.5)
ax.plot(X1,intercept + slope*X1, “-“, alpha=0.5)
ax.grid(color=’k’, linestyle=’:’, linewidth=0.2)
ax.scatter(X[SV][:,0],X[SV][:,1],marker=”+”, color=’g’, s=120)
ax.set_aspect(0.2)

Your turn: In the lectures we subtracted one equation from the other gives
(
y
+
−
y
−
)=2=
w
⊤
(
x
+
−
x
−
).
(
y
+
−
y
−
)
=
2
=
w
⊤
(
x
+
−
x
−
)
.
Can you find the margin as returned by the optimiser?
Slack variables¶
In the lectures, we introduced slack variables
ξ
n
≥0
ξ
n
≥
0
to account for the fact that the data may not be linearly separable. This led to the task of minimising
1
2
∥w
∥
2
+
C
N
∑
n=1
N
ξ
n
1
2
‖
w
‖
2
+
C
N
∑
n
=
1
N
ξ
n
subject to constraints.
max
α∈
R
N
or, equivalently,
min
α∈
R
N
subject to:
∑
n=1
N
α
n
−
1
2
∑
m,n=1
N
α
m
(
y
m
x
⊤
m
x
n
y
n
)
α
n
−
∑
n=1
N
α
n
+
1
2
∑
m,n=1
N
α
m
(
y
m
x
⊤
m
x
n
y
n
)
α
n
C
N
≥
α
n
≥0,n=1,…,N
∑
n=1
N
α
n
y
n
=0
max
α∈
R
N
∑
n
=
1
N
α
n
−
1
2
∑
m
,
n
=
1
N
α
m
(
y
m
x
m
⊤
x
n
y
n
)
α
n
or, equivalently,
min
α∈
R
N
−
∑
n
=
1
N
α
n
+
1
2
∑
m
,
n
=
1
N
α
m
(
y
m
x
m
⊤
x
n
y
n
)
α
n
subject to:
C
N
≥
α
n
≥
0
,
n
=
1
,
…
,
N
∑
n
=
1
N
α
n
y
n
=
0
Threshold
b
b
obtained from condition that for all support vectors with
α
n
<(C/N)
α
n
<
(
C
/
N
)
have slack variables
ξ
n
=0
ξ
n
=
0
(i.e., shift
b
b
to make SVs with
ξ
n
=0
ξ
n
=
0
lie on
±1
±
1
hyperplanes):
w
⊤
x+b=
y
n
.
w
⊤
x
+
b
=
y
n
.
Your turn: Introduce the upper limit on the
α
n
α
n
in the constraints set in the format that cvxopt will accept, following the lines of the example above.
Your turn: Read up to section 1.4.1 (before multi-class classification) of [https://scikit-learn.org/stable/modules/svm.html] and corroborate the results of our example above.
In [ ]:
In [ ]: