CS计算机代考程序代写 algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

492of 821

Part XIV

Sparse Kernel Machines

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

493of 821

Sparse Kernel Machines – Motivation

Nonlinear kernels extended our toolbox of methods
considerably

Wherever an inner product was used in an algorithm, we
can replace it with a kernel.
Kernels act as a kind of ’similarity’ measure and can be
defined over graphs, sets, strings, and documents.

But the kernel matrix is a square matrix with dimensions
equal to the number of data points N

In order to calculate it, the kernel function must be
evaluated for all pairs of training inputs.

Sparse Kernel Machines implement learning algorithms
where, for prediction, the kernels are only evaluated at a
subset of the training data.
Today we introduce the famous Support Vector Machine
— a non-probabilistic, non-parametric classifier, related to
Kernel Logistic Regression

partially alleviates the time complexity problems, but in
general approximations are needed for a scalable algorithm.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

494of 821

Maximum Margin Classifiers

Return to the two-class classification with a linear model

y(x) = w>φ(x) + b

where φ(x) is some fixed feature space mapping, and b is
the bias.
Training data are N input vectors x1, . . . , xN with
corresponding targets t1, . . . , tN where tn ∈ {−1,+1}.
The class of a new point is predicted as sign (y(x)).
Assume there exists a linear decision boundary. That
means, there exist w and b such that

tn sign (y(xn)) > 0 n = 1, . . . ,N.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

495of 821

The Multiple Separator problem

There may exist many solutions w and b for which the
linear classifier perfectly separates the two classes!

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

496of 821

Maximum Margin Classifiers

There may exist many solutions w and b for which the
linear classifier perfectly separates the two classes.
The perceptron algorithm can find a solution, but this
depends on the initial choice of parameters.
What is the decision boundary which results in the best
generalisation (smallest generalisation error)?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

497of 821

Maximum Margin Classifiers

The margin is the smallest distance between the decision
boundary and any of the samples. (We will see later why
y = ±1 on the margin.)

y = 1
y = 0

y = −1

margin

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

498of 821

Maximum Margin Classifiers

Support Vector Machines (SVMs) choose the decision
boundary which maximises the margin.

y = 1

y = 0

y = �1y = 1
y = 0

y = �1

y = 1

y = 0

y = �1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

499of 821

Reminder – Linear Classification

For a linear model with weights w and bias b, the decision
boundary is {x : w>x + b = 0}
Any x has projection xproj on the boundary, and

x− xproj = λ · w

since w is orthogonal to the decision boundary
But we also have

w>x− w>xproj = w>x + b = λ · w>w

wT z + b = 0x

xproj

0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

500of 821

Reminder – Linear Classification

So,

‖x− xproj‖ = |λ| · ‖w‖

=
|w>x + b|
‖w‖

The distance of x from the decision boundary is therefore
|w>x+b|

‖w‖ =
|y(x)|
‖w‖ .

wT z + b = 0x

xproj

0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

501of 821

Maximum Margin Classifiers

Support Vector Machines choose the decision boundary
which maximises the smallest distance to samples in both
classes.
We calculated the distance of a point x from the
hyperplane y(x) = 0 as |y(x)|/‖w‖.
Perfect classification means tny(xn) > 0 for all n.
Thus, the distance of xn from the decision boundary is

tn y(xn)
‖w‖

=
tn (w>φ(xn) + b)

‖w‖

y = 1

y = 0

y = �1y = 1
y = 0

y = �1

y = 1

y = 0

y = �1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

502of 821

Maximum Margin Classifiers

Support Vector Machines choose the decision boundary
which maximises the smallest distance to samples in both
classes.
For the maximum margin solution, solve

arg max
w,b

{
1
‖w‖

min
n

[
tn (w>φ(xn) + b)

]}
.

How can we solve this?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

503of 821

Maximum Margin Classifiers

Maximum margin solution

arg max
w,b

{
1
‖w‖

min
n

[
tn (w>φ(xn) + b)

]}
or

arg max
w,b,γ

{
γ

‖w‖

}
s.t. tn (w>φ(xn) + b) ≥ γ n = 1, . . . ,N.

By rescaling any (w, b) by 1/γ, we don’t change the
answer!

We can assume that γ = 1

The canonical representation for the decision hyperplane
is therefore

tn (w>φ(xn) + b) ≥ 1 n = 1, . . . ,N.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

504of 821

Maximum Margin Classifiers

Maximum margin solution

arg max
w,b

{
1
‖w‖

min
n

[
tn (w>φ(xn) + b)

]}
Transformed once

arg max
w,b

{
1
‖w‖

}
s.t. tn (w>φ(xn) + b) ≥ 1.

Transformed again

arg min
w,b

{
1
2
‖w‖2

}
s.t. tn (w>φ(xn) + b) ≥ 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

505of 821

Lagrange Multipliers

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

506of 821

Lagrange Multipliers

Find the stationary points for a function f (x1, x2) subject to
one or more constraints on the variables x1 and x2 written
in the form g(x1, x2) = 0.
Direct approach

1 Solve g(x1, x2) = 0 for one of the variables to get x2 = h(x1).
2 Insert the result into f (x1, x2) to get a function of one variable

f (x1, h(x1)).
3 Find the stationary point(s) x?1 of f (x1, h(x1)) with

corresponding value x?2 = h(x
?
1 ).

Finding x2 = h(x1) may be hard.
Symmetry in the variables x1 and x2 is lost.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

507of 821

Lagrange Multipliers

To solve
max

x
f (x) s.t. g(x) = 0

introduce the Lagrangian function

L(x, λ) = f (x) + λg(x)

from which we get the constraint stationary conditions

∇x L(x, λ) = ∇f (x) + λ∇g(x) = 0

and the constraint itself

∂ L(x, λ)
∂λ

= g(x) = 0.

This are D equations resulting from ∇x L(x, λ) and one
equation from ∂ L(x,λ)

∂λ
, together determining x? and λ.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

508of 821

Lagrange Multipliers – Example

Given f (x1, x2) = 1− x21 − x
2
2 subject to the constraint

g(x1, x2) = x1 + x2 − 1 = 0.
Define the Lagrangian function

L(x, λ) = 1− x21 − x
2
2 + λ(x1 + x2 − 1).

A stationary solution with respect to x1, x2, and λ must
satisfy

−2×1 + λ = 0
−2×2 + λ = 0

x1 + x2 − 1 = 0.

Therefore (x?1 , x
?
2) = (

1
2 ,

1
2 ) and λ = 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

509of 821

Lagrange Multipliers – Example

Given f (x1, x2) = 1− x21 − x
2
2 subject to the constraint

g(x1, x2) = x1 + x2 − 1 = 0.
Lagrangian L(x, λ) = 1− x21 − x

2
2 + λ(x1 + x2 − 1)

(x?1 , x
?
2) = (

1
2 ,

1
2 ).

g(x1, x2) = 0

x1

x2

(x?1, x
?
2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

510of 821

Lagrange Multipliers for Inequality Constraints

To solve
max

x
f (x) s.t. g(x)≥0

define the Lagrangian

L(x, λ) = f (x) + λg(x)

Solve for x and λ subject to the constraints
(Karush-Kuhn-Tucker or KKT conditions )

g(x) ≥ 0
λ≥0

λg(x)= 0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

511of 821

Lagrange Multipliers for General Case

Maximise f (x) subject to the constraints gj(x) = 0 for
j = 1, . . . , J, and hk(x) ≥ 0 for k = 1, . . . ,K.
Define the Lagrange multipliers {λj} and {µk}, and the
Lagrangian

L(x, {λj}, {µk}) = f (x) +
J∑

j=1

λj gj(x) +
K∑

k=1

µk hk(x).

Solve for x, {λj}, and {µk} subject to the constraints
(Karush-Kuhn-Tucker or KKT conditions )

µk≥0
µk hk(x) = 0

for k = 1, . . . ,K.
For minimisation of f (x), change the sign in front of the
Lagrange multipliers.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

512of 821

SVMs and Maximum Margin
Classifiers: Redux

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

513of 821

Maximum Margin Classifiers

Quadratic Programming (QP) problem: minimise a
quadratic function subject to linear constraints.

arg min
w,b

{
1
2
‖w‖2

}
s.t. tn (w>φ(xn) + b) ≥ 1.

Introduce Lagrange multipliers {an ≥ 0} to get the
Lagrangian

L(w, b, a) =
1
2
‖w‖2 −

N∑
n=1

an
{

tn (w>φ(xn) + b)− 1
}

note negative sign since minimisation problem

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

514of 821

Maximum Margin Classifiers

Lagrangian

L(w, b, a) =
1
2
‖w‖2 −

N∑
n=1

an
{

tn (w>φ(xn) + b)− 1
}
.

Derivatives with respect to w and b are

w =
N∑

n=1

an tn φ(xn) 0 =
N∑

n=1

an tn

Dual representation (again QP problem): maximise

L̃(a) =
N∑

n=1

an −
1
2

N∑
n=1

N∑
m=1

an am tn tm k(xn, xm)

N∑
n=1

an tn = 0

an ≥ 0 n = 1, . . . ,N.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

515of 821

Maximum Margin Classifiers

Dual representation (again QP problem): maximise

L̃(a) =
N∑

n=1

an −
1
2

N∑
n=1

N∑
m=1

an am tn tm k(xn, xm)

N∑
n=1

an tn = 0

an ≥ 0 n = 1, . . . ,N.

Vector form: maximise

L̃(a) = 1>a−
1
2

a>Qa

a>t = 0
an ≥ 0 n = 1, . . . ,N.

where
Qnm = tn · tm · k(xn, xm)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

516of 821

Maximum Margin Classifiers – Prediction

Predict via

y(x) = w>φ(x) + b =
N∑

n=1

an tn k(x, xn) + b

The Karush-Kuhn-Tucker (KKT) conditions are

an ≥ 0
tn y(xn)− 1 ≥ 0

an {tn y(xn)− 1} = 0.

Therefore, either an = 0 or tn y(xn)− 1 = 0.
If an = 0, no contribution to the prediction!

After training, use only the set S of points for which an > 0
(and therefore tn y(xn) = 1) holds.
S contains only the support vectors.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

517of 821

Maximum Margin Classifiers – Support Vectors

S contains only the support vectors.

Decision and margin boundaries for a two-class problem using
Gaussian kernel functions.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

518of 821

Maximum Margin Classifiers – Support Vectors

Now for the bias b.
Observe that for the support vectors, tn y(xn) = 1.
Therefore, we find

tn

(∑
m∈S

am tm k(xn, xm) + b

)
= 1

and can use any support vector to calculate b.
Numerically more stable: Multiply by tn, observe t2n = 1,
and average over all support vectors.

b =
1

NS


n∈S

(
tn −


m∈S

am tm k(xn, xm)

)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

519of 821

SVMs: Summary

Maximise the dual objective

L̃(a) = 1>a−
1
2

a>Qa

a>t = 0
an ≥ 0 n = 1, . . . ,N.

Make predictions via

y(x) =
N∑

n=1

an tn k(x, xn) + b

Dependence only on data points with an > 0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

520of 821

Soft Margin SVMs: Non-Separable
Case

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

521of 821

Non-Separable Data

What if the training data in the feature space are not
linearly separable?

Can we find a separator that makes fewest possible errors?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

522of 821

Non-Separable Data

What if the training data in the feature space are not
linearly separable?

Allow some data points to be on the ’wrong side’ of the
decision boundary.
Increase a penalty with distance from the decision
boundary.

Assume, the penalty increases linearly with distance.
Introduce slack variable ξn ≥ 0 for each data point n .

ξn




= 0, data point is correctly classified and
on margin boundary or beyond

< 1, data point is in correct margin = 1, data point is on the decision boundary > 1, data point is misclassified

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

523of 821

Non-Separable Data

Constraints of the separable classification

tn y(xn) ≥ 1, n = 1, . . . ,N

change now to

tn y(xn) ≥ 1−ξn, n = 1, . . . ,N

In sum, we have the constraints

ξn ≥ 0
ξn ≥ 1− tn y(xn).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

524of 821

Non-Separable Data

Find now

min
w,ξ

C
N∑

n=1

ξn +
1
2
‖w‖2 s.t. ξn ≥ 0

ξn ≥ 1− tn y(xn).

where C controls the trade-off between the slack variable
penalty and the margin.

Minimise the total slack associated with all data points

Any misclassified point contributes ξn > 1
therefore

∑N
n=1 ξn is an upper bound on the number of

misclassified points.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

525of 821

Non-Separable Data

The Lagrangian with the Lagrange multipliers
a = (a1, . . . , aN)> and µ = (µ1, . . . , µN)> is now

L(w, b, ξ, a,µ) =
1
2
‖w‖2 + C

N∑
n=1

ξn


N∑

n=1

an{tn y(xn)− 1 + ξn} −
N∑

n=1

µnξn.

The KKT conditions for n = 1, . . . ,N are

an ≥ 0
tn y(xn)− 1 + ξn ≥ 0

an(tn y(xn)− 1 + ξn) = 0
µn ≥ 0
ξn ≥ 0

µnξn = 0.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

526of 821

Non-Separable Data

The dual Lagrangian after eliminating w, b, ξ, and µ is then

L̃(a) =
N∑

n=1

an −
1
2

N∑
n=1

N∑
m=1

an am tn tm k(xn, xm)

N∑
n=1

an tn = 0

0 ≤ an ≤ C n = 1, . . . ,N.

The only change from the separable case, is the box
constraint via the parameter C.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

527of 821

Non-Separable Data

Equivalent formulation by Schölkopf et. al. (2000)

L̃(a) = −
1
2

N∑
n=1

N∑
m=1

an am tn tm k(xn, xm)

N∑
n=1

an tn = 0

0 ≤ an ≤ 1/N
N∑

n=1

an ≥ ν n = 1, . . . ,N.

where ν is both
1 an upper bound on the fraction of margin errors, and,
2 a lower bound on the fraction of support vectors.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

528of 821

Non-Separable Data

−2 0 2

−2

0

2

The ν-SVM algorithm using Gaussian kernels exp(−γ‖x− x′‖2)
with γ = 0.45 applied to a nonseparable data set in two
dimensions. Support vectors are indicated by circles.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

529of 821

Support Vector Machines – Limitations

Output are decisions, not posterior probabilities.
Extension to classification with more than two classes is
problematic.
There is a complexity parameter C (or ν) which must be
found (e.g. via cross-validation).
Predictions are expressed as linear combinations of kernel
functions that are centered on the training points.
Kernel matrix is required to be positive definite.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

530of 821

Optimization View of Machine Learning

Recall from decision theory that we want to minimize the
expected loss

p(mistake) = p(x ∈ R1, C2) + p(x ∈ R2, C1)

=


R1

p(x, C2) dx +

R2

p(x, C1) dx

For finite data, we minimise the negative log likelihood,
appropriately regularized.
More generally, we minimise the empirical risk (defined by
a loss function).

Remp(w,X, t) + Ω(w) =
N∑

n=1

`(xn, tn; w)︸ ︷︷ ︸
loss per example

+ Ω(w)︸ ︷︷ ︸
regulariser

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

531of 821

Logistic Regression Loss Function?

Consider the labels tn ∈ {−1,+1}.
Recall the definition of the sigmoid σ(·), then

p(tn = 1|xn) = σ(w>φ(xn)) =
1

1 + exp(−w>φ(xn))

p(tn = −1|xn) = 1− σ(w>φ(xn)) =
1

1 + exp(+w>φ(xn))

which can be compactly written as

p(tn|xn) = σ(w>φ(xn)) =
1

1 + exp(−tnw>φ(xn))

Assuming independence, the negative log likelihood

N∑
n=1

log
(
1 + exp(−tnw>φ(xn))

)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

532of 821

SVM Loss Function?

Soft margin SVM

min
w,ξ

C
N∑

n=1

ξn +
1
2
‖w‖2

s.t. ξn ≥ 0
ξn ≥ 1− tn y(xn).

Observe the equivalent unconstrained loss:

min
ξ,y

ξ

s.t.ξ > 0
ξ > 1− y.

is equivalent to
min

y
max{0, 1− y}

We denote the hinge loss by [1− y]+ = max{0, 1− y}.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

533of 821

Loss functions

Linear Regression (squared loss)

1
2

N∑
n=1

{tn−w>φ(xn)}2+
λ

2

D∑
d=1

w2d =
1
2

(t−Φw)T(t−Φw)+
λ

2
w>w

Logistic Regression (log loss)

− ln p(t |w) = −
N∑

n=1

log
(
1 + exp(−tnw>φ(xn))

)
+
λ

2
w>w

Support Vector Machine (hinge loss)

C
N∑

n=1

[1− tnw>φ(xn)]+ +
1
2

w>w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

534of 821

Loss functions

−2 −1 0 1 2
z

E(z)

Hinge loss in blue, cross entropy loss in red

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Sparse Kernel Machines

SVMs and Maximum
Margin Classifiers

Lagrange Multipliers

SVMs and Maximum
Margin Classifiers:
Redux

Soft Margin SVMs:
Non-Separable Case

Loss functions

535of 821

Further Reading (No Assessable)

Our derivation of the SVM
provides a nice example of Lagrangian optimisation
yields a neat dual formulation
is of historical value

It is possible to derive an SVM algorithm much more
simply however. See e.g.:

Olivier Chapelle
Training a support vector machine in the primal
Neural computation, 2007 – MIT Press.

Kernel Methods
Review
Kernel Density Estimation
Kernel Methods for Classification
From Feature Functions to Kernel Methods
Dual Representations
Kernel Construction