Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
492of 821
Part XIV
Sparse Kernel Machines
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
493of 821
Sparse Kernel Machines – Motivation
Nonlinear kernels extended our toolbox of methods
considerably
Wherever an inner product was used in an algorithm, we
can replace it with a kernel.
Kernels act as a kind of ’similarity’ measure and can be
defined over graphs, sets, strings, and documents.
But the kernel matrix is a square matrix with dimensions
equal to the number of data points N
In order to calculate it, the kernel function must be
evaluated for all pairs of training inputs.
Sparse Kernel Machines implement learning algorithms
where, for prediction, the kernels are only evaluated at a
subset of the training data.
Today we introduce the famous Support Vector Machine
— a non-probabilistic, non-parametric classifier, related to
Kernel Logistic Regression
partially alleviates the time complexity problems, but in
general approximations are needed for a scalable algorithm.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
494of 821
Maximum Margin Classifiers
Return to the two-class classification with a linear model
y(x) = w>φ(x) + b
where φ(x) is some fixed feature space mapping, and b is
the bias.
Training data are N input vectors x1, . . . , xN with
corresponding targets t1, . . . , tN where tn ∈ {−1,+1}.
The class of a new point is predicted as sign (y(x)).
Assume there exists a linear decision boundary. That
means, there exist w and b such that
tn sign (y(xn)) > 0 n = 1, . . . ,N.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
495of 821
The Multiple Separator problem
There may exist many solutions w and b for which the
linear classifier perfectly separates the two classes!
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
496of 821
Maximum Margin Classifiers
There may exist many solutions w and b for which the
linear classifier perfectly separates the two classes.
The perceptron algorithm can find a solution, but this
depends on the initial choice of parameters.
What is the decision boundary which results in the best
generalisation (smallest generalisation error)?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
497of 821
Maximum Margin Classifiers
The margin is the smallest distance between the decision
boundary and any of the samples. (We will see later why
y = ±1 on the margin.)
y = 1
y = 0
y = −1
margin
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
498of 821
Maximum Margin Classifiers
Support Vector Machines (SVMs) choose the decision
boundary which maximises the margin.
y = 1
y = 0
y = �1y = 1
y = 0
y = �1
y = 1
y = 0
y = �1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
499of 821
Reminder – Linear Classification
For a linear model with weights w and bias b, the decision
boundary is {x : w>x + b = 0}
Any x has projection xproj on the boundary, and
x− xproj = λ · w
since w is orthogonal to the decision boundary
But we also have
w>x− w>xproj = w>x + b = λ · w>w
wT z + b = 0x
xproj
0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
500of 821
Reminder – Linear Classification
So,
‖x− xproj‖ = |λ| · ‖w‖
=
|w>x + b|
‖w‖
The distance of x from the decision boundary is therefore
|w>x+b|
‖w‖ =
|y(x)|
‖w‖ .
wT z + b = 0x
xproj
0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
501of 821
Maximum Margin Classifiers
Support Vector Machines choose the decision boundary
which maximises the smallest distance to samples in both
classes.
We calculated the distance of a point x from the
hyperplane y(x) = 0 as |y(x)|/‖w‖.
Perfect classification means tny(xn) > 0 for all n.
Thus, the distance of xn from the decision boundary is
tn y(xn)
‖w‖
=
tn (w>φ(xn) + b)
‖w‖
y = 1
y = 0
y = �1y = 1
y = 0
y = �1
y = 1
y = 0
y = �1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
502of 821
Maximum Margin Classifiers
Support Vector Machines choose the decision boundary
which maximises the smallest distance to samples in both
classes.
For the maximum margin solution, solve
arg max
w,b
{
1
‖w‖
min
n
[
tn (w>φ(xn) + b)
]}
.
How can we solve this?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
503of 821
Maximum Margin Classifiers
Maximum margin solution
arg max
w,b
{
1
‖w‖
min
n
[
tn (w>φ(xn) + b)
]}
or
arg max
w,b,γ
{
γ
‖w‖
}
s.t. tn (w>φ(xn) + b) ≥ γ n = 1, . . . ,N.
By rescaling any (w, b) by 1/γ, we don’t change the
answer!
We can assume that γ = 1
The canonical representation for the decision hyperplane
is therefore
tn (w>φ(xn) + b) ≥ 1 n = 1, . . . ,N.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
504of 821
Maximum Margin Classifiers
Maximum margin solution
arg max
w,b
{
1
‖w‖
min
n
[
tn (w>φ(xn) + b)
]}
Transformed once
arg max
w,b
{
1
‖w‖
}
s.t. tn (w>φ(xn) + b) ≥ 1.
Transformed again
arg min
w,b
{
1
2
‖w‖2
}
s.t. tn (w>φ(xn) + b) ≥ 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
505of 821
Lagrange Multipliers
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
506of 821
Lagrange Multipliers
Find the stationary points for a function f (x1, x2) subject to
one or more constraints on the variables x1 and x2 written
in the form g(x1, x2) = 0.
Direct approach
1 Solve g(x1, x2) = 0 for one of the variables to get x2 = h(x1).
2 Insert the result into f (x1, x2) to get a function of one variable
f (x1, h(x1)).
3 Find the stationary point(s) x?1 of f (x1, h(x1)) with
corresponding value x?2 = h(x
?
1 ).
Finding x2 = h(x1) may be hard.
Symmetry in the variables x1 and x2 is lost.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
507of 821
Lagrange Multipliers
To solve
max
x
f (x) s.t. g(x) = 0
introduce the Lagrangian function
L(x, λ) = f (x) + λg(x)
from which we get the constraint stationary conditions
∇x L(x, λ) = ∇f (x) + λ∇g(x) = 0
and the constraint itself
∂ L(x, λ)
∂λ
= g(x) = 0.
This are D equations resulting from ∇x L(x, λ) and one
equation from ∂ L(x,λ)
∂λ
, together determining x? and λ.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
508of 821
Lagrange Multipliers – Example
Given f (x1, x2) = 1− x21 − x
2
2 subject to the constraint
g(x1, x2) = x1 + x2 − 1 = 0.
Define the Lagrangian function
L(x, λ) = 1− x21 − x
2
2 + λ(x1 + x2 − 1).
A stationary solution with respect to x1, x2, and λ must
satisfy
−2×1 + λ = 0
−2×2 + λ = 0
x1 + x2 − 1 = 0.
Therefore (x?1 , x
?
2) = (
1
2 ,
1
2 ) and λ = 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
509of 821
Lagrange Multipliers – Example
Given f (x1, x2) = 1− x21 − x
2
2 subject to the constraint
g(x1, x2) = x1 + x2 − 1 = 0.
Lagrangian L(x, λ) = 1− x21 − x
2
2 + λ(x1 + x2 − 1)
(x?1 , x
?
2) = (
1
2 ,
1
2 ).
g(x1, x2) = 0
x1
x2
(x?1, x
?
2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
510of 821
Lagrange Multipliers for Inequality Constraints
To solve
max
x
f (x) s.t. g(x)≥0
define the Lagrangian
L(x, λ) = f (x) + λg(x)
Solve for x and λ subject to the constraints
(Karush-Kuhn-Tucker or KKT conditions )
g(x) ≥ 0
λ≥0
λg(x)= 0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
511of 821
Lagrange Multipliers for General Case
Maximise f (x) subject to the constraints gj(x) = 0 for
j = 1, . . . , J, and hk(x) ≥ 0 for k = 1, . . . ,K.
Define the Lagrange multipliers {λj} and {µk}, and the
Lagrangian
L(x, {λj}, {µk}) = f (x) +
J∑
j=1
λj gj(x) +
K∑
k=1
µk hk(x).
Solve for x, {λj}, and {µk} subject to the constraints
(Karush-Kuhn-Tucker or KKT conditions )
µk≥0
µk hk(x) = 0
for k = 1, . . . ,K.
For minimisation of f (x), change the sign in front of the
Lagrange multipliers.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
512of 821
SVMs and Maximum Margin
Classifiers: Redux
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
513of 821
Maximum Margin Classifiers
Quadratic Programming (QP) problem: minimise a
quadratic function subject to linear constraints.
arg min
w,b
{
1
2
‖w‖2
}
s.t. tn (w>φ(xn) + b) ≥ 1.
Introduce Lagrange multipliers {an ≥ 0} to get the
Lagrangian
L(w, b, a) =
1
2
‖w‖2 −
N∑
n=1
an
{
tn (w>φ(xn) + b)− 1
}
note negative sign since minimisation problem
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
514of 821
Maximum Margin Classifiers
Lagrangian
L(w, b, a) =
1
2
‖w‖2 −
N∑
n=1
an
{
tn (w>φ(xn) + b)− 1
}
.
Derivatives with respect to w and b are
w =
N∑
n=1
an tn φ(xn) 0 =
N∑
n=1
an tn
Dual representation (again QP problem): maximise
L̃(a) =
N∑
n=1
an −
1
2
N∑
n=1
N∑
m=1
an am tn tm k(xn, xm)
N∑
n=1
an tn = 0
an ≥ 0 n = 1, . . . ,N.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
515of 821
Maximum Margin Classifiers
Dual representation (again QP problem): maximise
L̃(a) =
N∑
n=1
an −
1
2
N∑
n=1
N∑
m=1
an am tn tm k(xn, xm)
N∑
n=1
an tn = 0
an ≥ 0 n = 1, . . . ,N.
Vector form: maximise
L̃(a) = 1>a−
1
2
a>Qa
a>t = 0
an ≥ 0 n = 1, . . . ,N.
where
Qnm = tn · tm · k(xn, xm)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
516of 821
Maximum Margin Classifiers – Prediction
Predict via
y(x) = w>φ(x) + b =
N∑
n=1
an tn k(x, xn) + b
The Karush-Kuhn-Tucker (KKT) conditions are
an ≥ 0
tn y(xn)− 1 ≥ 0
an {tn y(xn)− 1} = 0.
Therefore, either an = 0 or tn y(xn)− 1 = 0.
If an = 0, no contribution to the prediction!
After training, use only the set S of points for which an > 0
(and therefore tn y(xn) = 1) holds.
S contains only the support vectors.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
517of 821
Maximum Margin Classifiers – Support Vectors
S contains only the support vectors.
Decision and margin boundaries for a two-class problem using
Gaussian kernel functions.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
518of 821
Maximum Margin Classifiers – Support Vectors
Now for the bias b.
Observe that for the support vectors, tn y(xn) = 1.
Therefore, we find
tn
(∑
m∈S
am tm k(xn, xm) + b
)
= 1
and can use any support vector to calculate b.
Numerically more stable: Multiply by tn, observe t2n = 1,
and average over all support vectors.
b =
1
NS
∑
n∈S
(
tn −
∑
m∈S
am tm k(xn, xm)
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
519of 821
SVMs: Summary
Maximise the dual objective
L̃(a) = 1>a−
1
2
a>Qa
a>t = 0
an ≥ 0 n = 1, . . . ,N.
Make predictions via
y(x) =
N∑
n=1
an tn k(x, xn) + b
Dependence only on data points with an > 0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
520of 821
Soft Margin SVMs: Non-Separable
Case
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
521of 821
Non-Separable Data
What if the training data in the feature space are not
linearly separable?
Can we find a separator that makes fewest possible errors?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
522of 821
Non-Separable Data
What if the training data in the feature space are not
linearly separable?
Allow some data points to be on the ’wrong side’ of the
decision boundary.
Increase a penalty with distance from the decision
boundary.
Assume, the penalty increases linearly with distance.
Introduce slack variable ξn ≥ 0 for each data point n .
ξn
= 0, data point is correctly classified and
on margin boundary or beyond
< 1, data point is in correct margin = 1, data point is on the decision boundary > 1, data point is misclassified
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
523of 821
Non-Separable Data
Constraints of the separable classification
tn y(xn) ≥ 1, n = 1, . . . ,N
change now to
tn y(xn) ≥ 1−ξn, n = 1, . . . ,N
In sum, we have the constraints
ξn ≥ 0
ξn ≥ 1− tn y(xn).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
524of 821
Non-Separable Data
Find now
min
w,ξ
C
N∑
n=1
ξn +
1
2
‖w‖2 s.t. ξn ≥ 0
ξn ≥ 1− tn y(xn).
where C controls the trade-off between the slack variable
penalty and the margin.
Minimise the total slack associated with all data points
Any misclassified point contributes ξn > 1
therefore
∑N
n=1 ξn is an upper bound on the number of
misclassified points.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
525of 821
Non-Separable Data
The Lagrangian with the Lagrange multipliers
a = (a1, . . . , aN)> and µ = (µ1, . . . , µN)> is now
L(w, b, ξ, a,µ) =
1
2
‖w‖2 + C
N∑
n=1
ξn
−
N∑
n=1
an{tn y(xn)− 1 + ξn} −
N∑
n=1
µnξn.
The KKT conditions for n = 1, . . . ,N are
an ≥ 0
tn y(xn)− 1 + ξn ≥ 0
an(tn y(xn)− 1 + ξn) = 0
µn ≥ 0
ξn ≥ 0
µnξn = 0.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
526of 821
Non-Separable Data
The dual Lagrangian after eliminating w, b, ξ, and µ is then
L̃(a) =
N∑
n=1
an −
1
2
N∑
n=1
N∑
m=1
an am tn tm k(xn, xm)
N∑
n=1
an tn = 0
0 ≤ an ≤ C n = 1, . . . ,N.
The only change from the separable case, is the box
constraint via the parameter C.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
527of 821
Non-Separable Data
Equivalent formulation by Schölkopf et. al. (2000)
L̃(a) = −
1
2
N∑
n=1
N∑
m=1
an am tn tm k(xn, xm)
N∑
n=1
an tn = 0
0 ≤ an ≤ 1/N
N∑
n=1
an ≥ ν n = 1, . . . ,N.
where ν is both
1 an upper bound on the fraction of margin errors, and,
2 a lower bound on the fraction of support vectors.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
528of 821
Non-Separable Data
−2 0 2
−2
0
2
The ν-SVM algorithm using Gaussian kernels exp(−γ‖x− x′‖2)
with γ = 0.45 applied to a nonseparable data set in two
dimensions. Support vectors are indicated by circles.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
529of 821
Support Vector Machines – Limitations
Output are decisions, not posterior probabilities.
Extension to classification with more than two classes is
problematic.
There is a complexity parameter C (or ν) which must be
found (e.g. via cross-validation).
Predictions are expressed as linear combinations of kernel
functions that are centered on the training points.
Kernel matrix is required to be positive definite.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
530of 821
Optimization View of Machine Learning
Recall from decision theory that we want to minimize the
expected loss
p(mistake) = p(x ∈ R1, C2) + p(x ∈ R2, C1)
=
∫
R1
p(x, C2) dx +
∫
R2
p(x, C1) dx
For finite data, we minimise the negative log likelihood,
appropriately regularized.
More generally, we minimise the empirical risk (defined by
a loss function).
Remp(w,X, t) + Ω(w) =
N∑
n=1
`(xn, tn; w)︸ ︷︷ ︸
loss per example
+ Ω(w)︸ ︷︷ ︸
regulariser
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
531of 821
Logistic Regression Loss Function?
Consider the labels tn ∈ {−1,+1}.
Recall the definition of the sigmoid σ(·), then
p(tn = 1|xn) = σ(w>φ(xn)) =
1
1 + exp(−w>φ(xn))
p(tn = −1|xn) = 1− σ(w>φ(xn)) =
1
1 + exp(+w>φ(xn))
which can be compactly written as
p(tn|xn) = σ(w>φ(xn)) =
1
1 + exp(−tnw>φ(xn))
Assuming independence, the negative log likelihood
N∑
n=1
log
(
1 + exp(−tnw>φ(xn))
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
532of 821
SVM Loss Function?
Soft margin SVM
min
w,ξ
C
N∑
n=1
ξn +
1
2
‖w‖2
s.t. ξn ≥ 0
ξn ≥ 1− tn y(xn).
Observe the equivalent unconstrained loss:
min
ξ,y
ξ
s.t.ξ > 0
ξ > 1− y.
is equivalent to
min
y
max{0, 1− y}
We denote the hinge loss by [1− y]+ = max{0, 1− y}.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
533of 821
Loss functions
Linear Regression (squared loss)
1
2
N∑
n=1
{tn−w>φ(xn)}2+
λ
2
D∑
d=1
w2d =
1
2
(t−Φw)T(t−Φw)+
λ
2
w>w
Logistic Regression (log loss)
− ln p(t |w) = −
N∑
n=1
log
(
1 + exp(−tnw>φ(xn))
)
+
λ
2
w>w
Support Vector Machine (hinge loss)
C
N∑
n=1
[1− tnw>φ(xn)]+ +
1
2
w>w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
534of 821
Loss functions
−2 −1 0 1 2
z
E(z)
Hinge loss in blue, cross entropy loss in red
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Sparse Kernel Machines
SVMs and Maximum
Margin Classifiers
Lagrange Multipliers
SVMs and Maximum
Margin Classifiers:
Redux
Soft Margin SVMs:
Non-Separable Case
Loss functions
535of 821
Further Reading (No Assessable)
Our derivation of the SVM
provides a nice example of Lagrangian optimisation
yields a neat dual formulation
is of historical value
It is possible to derive an SVM algorithm much more
simply however. See e.g.:
Olivier Chapelle
Training a support vector machine in the primal
Neural computation, 2007 – MIT Press.
Kernel Methods
Review
Kernel Density Estimation
Kernel Methods for Classification
From Feature Functions to Kernel Methods
Dual Representations
Kernel Construction