CS计算机代考程序代写 Bayesian flex Hidden Markov Mode AI algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

444of 821

Part XIII

Kernel Methods

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

445of 821

Original Input versus Feature Space

Basic linear regression models direct input x.
These models are trivially extended to work on a fixed
nonlinear transformation of the inputs using a vector of
basis functions φ(x).
Example: Use two Gaussian basis functions centered at
the green crosses in the input space.

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

446of 821

Original Input versus Feature Space

Linear decision boundaries in the feature space
correspond to nonlinear decision boundaries in the input
space.
Classes which are NOT linearly separable in the input
space can become linearly separable in the feature space.
BUT: If classes overlap in input space, they will also
overlap in feature space.
Nonlinear features φ(x) cannot remove the overlap.

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

447of 821

Where have we been?

Basis function models (regression, classification)
Flexible basis function models (neural networks)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

448of 821

Where are we going?

Why not use all training data to make predictions for the
test inputs?
Basic ideas:

Continuity : Mostly targets don’t change abruptly.
Similarity : Each training pair (input, target) tells us
something about the possible targets in the neighbourhood
of the input.

Kernels formalise those ideas.
Nonparametric methods: do not rely on a fixed number of
parameters, but rather usually on storing the entire training
set (various loose definitions are used here).
Often leading to Shallow (c.f. Deep) learning.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

449of 821

How are we going there?

0) Histograms
1) Simple density estimation kernels (must only be
non-negative integrable)
2) Implicit feature mapping kernels (must be positive
definite)

Warning:

The term kernel is highly overloaded.
Even today we consider two different types of kernel:

1 Smoothing kernel / density estimation kernel / Parzen
window estimator kernel / Nadaraya-Watson kernel

2 Positive semidefinite kernel / reproducing kernel hilbert
space kernel / implicit feature map kernel / support vector
machine kernel / ≈ Gaussian process covariance function
or kernel

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

450of 821

Kernel Density Estimation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

451of 821

Density Estimation

Suppose we observe data points {xn}Nn=1
e.g. just N real numbers

Suppose we believe these are drawn independently from
some distribution p(x)

e.g. p(x) is Gaussian with unknown mean and variance

Density estimation problem: Estimate p(x) from data

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

452of 821

Nonparametric Density Estimation – Histogram

Partition the space into bins of width ∆i.
Count the number ni of samples falling into each bin i.
Normalise.

pi =
ni

N∆i

∆ = 0.04

0 0.5 1
0

5

∆ = 0.08

0 0.5 1
0

5

∆ = 0.25

0 0.5 1
0

5

Histogram of 50 data points generated from the distribution
shown by the green curve for varying common bin width ∆

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

453of 821

Nonparametric Density Estimation – Histogram

Advantages:
Data can be discarded after calculating the pi (therefore
seems to not be non-parametric, though often described
as such).
Algorithm can be applied to sequentially arriving data.

Disadvantages:
Dependency on bin width ∆i.
Discontinuities due to the bin edges.
Exponential scaling with the dimensionality D of the data.
Need MD bins for D dimensions and M bins per dimension.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

454of 821

Nonparametric Density Estimation – Refined

Draw data from some unknown probability distribution p(x)
in a D-dimensional space.
Consider a small region R containing x. Probability mass
associated with this region

P =

R

p(x) dx

Data set of N observations drawn from p(x). Total number
K of points found inside of R is distributed according to the
binomial distribution

Bin(K |N,P) =
N!

K!(N − K)!
PK(1− P)N−K

Expectation of K : E [K/N] = P
Variance of K : var[K/N] = P(1− P)/N

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

455of 821

Nonparametric Density Estimation – Refined

Expectation of K : E [K/N] = P
Variance of K : var[K/N] = P(1− P)/N
For large N, the distribution will be sharply peaked and
therefore

P ≈ K/N

Assuming also that the region has volume V and the
region is small enough for p(x) to be roughly constant, then

P ≈ p(x)V

Combining two contradictory assumptions
Region R is small enough for p(x) to be roughly constant.
Region R is large enough to have enough K points falling
into it to get a sharp peak for the binomial distribution.

p(x) ≈
K

NV

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

456of 821

Nonparametric Density Estimation – Refined

Two ways to exploit

p(x) ≈
K

NV

1 Fix K and determine the volume V from the data :
K-nearest-neighbours density estimation

2 Fix V and determine K from the data :
kernel density estimation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

457of 821

Nonparametric Estimation – Nearest Neighbour

Fix K and find an appropriate value for V.
Consider a small sphere around x and then allow the
radius to increase until it contains exactly K data points.
Calculate the probability by

p(x) ≈
K

NV

K = 1

0 0.5 1
0

5

K = 5

0 0.5 1
0

5

K = 30

0 0.5 1
0

5

Nearest neighbour density model for different K.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

458of 821

Nonparametric Estimation – Parzen Estimator

Define region R to be a small hypercube around x
Define Parzen window (kernel function)

k(u) =

{
1, |ui| ≤ 1/2, i = 1, . . . ,D
0, otherwise

Total number of data points inside of the hypercube
centered at x with lengths h:

K =
N∑

n=1

k
(

x− xn
h

)
Density estimate for p(x)

p(x) ≈
K

NV
=

1
N

N∑
n=1

1
hD

k
(

x− xn
h

)
Interpret as sum over N cubes centered at each of the xn.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

459of 821

Nonparametric Estimation – Parzen Estimator

Solved problem with Histogram of fixed discretisation of
input space.
Remaining problem: Discontinuities because of the
hypercube (either in or out).
Choose a smoother kernel function (and normalise
correctly).
Common choice : Gaussian kernel

k(x) = N (x|0, h2I) =
1

(2πh2)D/2
exp

{

‖x‖2

2h2

}
Can choose any other kernel function k(u) obeying

k(u) ≥ 0,∫
k(u) du = 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

460of 821

Nonparametric Estimation – Parzen Estimator

Gaussian Kernel Density Estimate (KDE):

p(x) =
1
N

N∑
n=1

1
(2πh2)D/2

exp

{

‖x− xn‖2

2h2

}
h controls the trade-off between sensitivity to noise and
over-smoothing.

h = 0.005

0 0.5 1
0

5

h = 0.07

0 0.5 1
0

5

h = 0.2

0 0.5 1
0

5

Kernel density model with Gaussian kernel for different h.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

461of 821

Kernel Methods for Classification

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

462of 821

The Role of Training Data

Parametric methods
Learn the model parameter w from the training data t.
Discard the training data t.

Nonparametric methods
Use training data directly for prediction
k-nearest neighbours : use k-closest data from the ’training’
set for classification

Kernel methods
Base prediction on linear combination of kernel functions
evaluated at the training data.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

463of 821

Features

A feature is a measurable property of a phenomenon
being observed or any derived property thereof

raw features: the original data
derived features: mappings of the original features to some
other space (possibly high- or infinite dimensional, e.g.,
basis functions)

Feature selection: which features matter for the problem at
hand?

redundant features
problem dependent

Feature extraction: can we combine the important features
to a smaller set of new features?

compact representation versus ability to explain to a human

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

464of 821

Very simple example – XOR

x1 x2 y = x1xorx2
-1 -1 1
-1 1 -1
1 -1 -1
1 1 1

-1.0 -0.5 0.5 1.0

-1.0

-0.5

0.5

1.0

not linearly separable (why?)
raw features {(−1,−1), (−1, 1), (1,−1), (1, 1)}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

465of 821

Very simple example – XOR

x1 x2 xnew = x1 · x2 y = x1xorx2
-1 -1 1 1
-1 1 -1 -1
1 -1 -1 -1
1 1 1 1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

feature extraction: xnew = x1 · x2
data is now separable!
All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions φ(x).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

466of 821

Kernel methods in one slide

Consider a labelled training set {xi, ti}Ni=1
On a new point x, we will predict

y(x) =
N∑

i=1

ai · K(x, xi)

where {ai}Ni=1 are weights to be determined based on our
training set, and K(·, ·) is a kernel function
This is a major departure from the linear models
considered previously!
The kernel function measures the similarity between any
two examples

Prediction is a weighted average of the training targets
Weights depend on the similarity of x to each training
example

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

467of 821

From Feature Functions to Kernels (1)

Suppose we perform linear regression with a feature
matrix Φ and target vector t, where

Φ =



φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)


. . .

φ0(xN) φ1(xN) . . . φM−1(xN)


 =



φ(x1)>
φ(x2)>
. . .

φ(xN)>




Recall that the optimal (regularised) w? is

w? = (λI + Φ>Φ)−1Φ>t

Thus, the prediction for feature vector of new point x with

y(x) = φ(x)>w? = φ(x)>(λI + Φ>Φ)−1Φ>t

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

468of 821

From Feature Functions to Kernels (2)

Prediction with optimal (regularised) w?

y(x) = φ(x)>w? = φ(x)>(λI + Φ>Φ)−1Φ>t

Suppose that M is very large. Then, the inverse of an
M ×M matrix above will be expensive to compute.
Consider however the following trick:

φ(x)>(λI + Φ>Φ)−1Φ>t = φ(x)>Φ>(λI + ΦΦ>)−1t

(this is a useful idea, worth verifying)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

469of 821

From Feature Functions to Kernels (3)

We have thus written the prediction as

y(x) = φ(x)>Φ>(λI + ΦΦ>)−1t

=

N∑
i=1

ai · K(x, xi)

as before (check by identifying the ai and K in the above).
Now, our prediction is determined by an N × N rather than
M ×M matrix
ΦΦ> is known as the kernel matrix of the training data

In some sense, measures the similarities between the
training instances
For example, for normalised vectors (the case for some
implicit kernel mappings), the inner product between two
points is a measure of similarity:

argmax
v:‖v‖=‖u‖

〈u, v〉 = u.

NB: the φ(xm)> are never explicitly needed, but only the
inner products φ(xm)>φ(xn) ≡ K(xm, xn). (!)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

470of 821

Dual Representations

Consider a linear regression model with regularised
sum-of-squares error

J(w) =
1
2

N∑
n=1

(w>φ(xn)− tn)2 +
λ

2
w>w

where λ ≥ 0.
We could also write this in more compact form as

J(w) =
1
2

(t−Φw)>(t−Φw) +
λ

2
w>w

with the target vector t = (t1, . . . , tN)>, and the design
matrix

Φ =



φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)


. . .

φ0(xN) φ1(xN) . . . φM−1(xN)


 .

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

471of 821

Dual Representations

Critical points for J(w)

J(w) =
1
2

(t−Φw)>(t−Φw) +
λ

2
w>w

satisfy

(Φ>Φ + λI)w = Φ>t

λw = Φ> (t−Φw)
w = Φ>a

=

N∑
n=1

φ(xn)an

where a = (a1, . . . , aN)> with components

an = −
1
λ

{
w>φ(xn)− tn

}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

472of 821

Dual Representations

Now express J(w) as a function of this new variable a
instead of w via the relation w = Φ>a

J(a) =
1
2

a>ΦΦ>ΦΦ>a− a>ΦΦ>t +
1
2

t>t +
λ

2
a>ΦΦ>a

where again t = (t1, . . . , tN)>.
Known as the dual representation

Define the N × N Gram matrix K = ΦΦ> with elements

Knm = φ(xn)>φ(xm) = k(xn, xm).

Express J(a) now as

J(a) =
1
2

a>KKa− a>Kt +
1
2

t>t +
λ

2
a>Ka.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

473of 821

Stationary Point of J(a)

Let’s calculate the stationary point for

J(a) =
1
2

a>KKa− a>Kt +
1
2

t>t +
λ

2
a>Ka.

Gradient condition

∇a J(a) = 0 = KKa−Kt + λKa

Therefore
a? = (K + λ IN)−1t.

Hessian is positive semi-definite

Ha J(a) = KK + λK � 0

because ξ>Kξ = ξ>ΦΦ>ξ =
∥∥∥Φ>ξ∥∥∥2 ≥ 0, ∀ξ, and

similarly for the first term.
Hence a? minimises J(a).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

474of 821

Prediction for the Linear Regression Model

Putting a? which minimises the error J(a) into the
prediction model for the linear regression, we get for the
prediction

y(x) = w>φ(x) = a?>Φφ(x) = (Φφ(x))>a?

= k(x)>(K + λ IN)−1t

where we defined the vector k(x) with elements
kn(x) = k(xn, x) = φ(xn)>φ(x).
The prediction y(x) can be expressed entirely in terms of
the kernel function k(x, x′) evaluated at the training and
test data.
Looks familiar? See Bayesian Linear Regression.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

475of 821

The Kernel Function

The kernel function is defined over two points, x and x′, of
the input space

k(x, x′) = φ(x)>φ(x′).

k(x, x′) is symmetric.
It is an inner product of two vectors of basis functions

k(x, x′) = 〈φ(x),φ(x′)〉.

For prediction, the kernel function will be evaluated at the
training data points. (See next slides.)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

476of 821

Dual Representation

What have we gained by the dual representation?
Need to invert an N × N matrix now, where N is the
number of data points. Can be large!

In the parameter space formulation, we “only” needed to
invert an M ×M matrix, where M was the number of basis
functions.
But, a kernel corresponds to an inner product of basis
functions. So we can use a large number of basis functions,
even infinitely many.

We can construct new valid kernels directly from given
ones (whatever the corresponding basis functions of the
new kernel might be).
As a kernel defines a kind of ’similarity’ between two points
in the input space, we can define kernels over graphs,
sets, strings, and text documents.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

477of 821

Kernel Construction

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

478of 821

Kernels from Basis Functions

1 Choose a set of basis functions

{φ1, . . . , φM}

2 Find a new kernel as an inner product between vectors of
basis functions evaluated at x and x′

k(x, x′) = φ(x)>φ(x) =
M∑

i=1

φi(x)φi(x
′)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

479of 821

Kernels from Basis Functions

Polynomial
basis functions

Corresponding kernel
k(x, x′) as function of x for

x′ = −0.5 (red cross).

−1 0 1
−1

−0.5

0

0.5

1

−1 0 1
−0.4

0.0

1.0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

480of 821

Kernels from Basis Functions

Gaussian basis functions

Corresponding kernel
k(x, x′) as function of x for

x′ = 0.0 (red cross).

−1 0 1
0

0.25

0.5

0.75

1

−1 0 1
0.0

1.0

2.0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

481of 821

Kernels from Basis Functions

Logistic Sigmoid
basis functions

Corresponding kernel
k(x, x′) as function of x for

x′ = 0.0 (red cross).

−1 0 1
0

0.25

0.5

0.75

1

−1 0 1
0.0

3.0

6.0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

482of 821

Kernels by Guessing a Kernel Function

1 Choose a mapping from two points of the input space to a
real number, which is symmetric in its arguments, e.g.

k(x, z) = (x>z)2 = k(z, x)

2 Try to write this as an inner product of a vector valued
function evaluated at the arguments x and z, e.g.

k(x, z) = (x>z)2 = (x1 z1 + x2 z2)2

= x21 z
2
1 + 2×1 z2 x2 z2 + x

2
2 z

2
2

= (x21,

2 x1 x2, x
2
2)(z

2
1,

2 z1 z2, z
2
2)
>

= φ(x)>φ(z)

with the feature mapping φ(x) = (x21,

2 x1 x2, x22)
>.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

483of 821

New Kernels From Theory

A necessary and sufficient condition for k(x, x′) to be a
valid kernel is that the kernel matrix K, whose elements
are k(xn, xm), should be positive semidefinite for all
possible choices of the set {xn}.
Previously, we constructed K = ΦΦ>, which is
automatically positive semidefinite (why?)

If we can explicitly construct the kernel via basis functions,
we are good
Even if we cannot find the basis functions easily, we may be
able to deduce k(x, x′) is a valid kernel

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

484of 821

New Kernels From Other Kernels

Given valid kernels k1(x, x′) and k2(x, x′), the following kernels
are also valid:

k(x, x′) = c k1(x, x′)
k(x, x′) = f (x) k1(x, x′) f (x′)
k(x, x′) = q(k1(x, x′))
k(x, x′) = exp(k1(x, x′))
k(x, x′) = k1(x, x′) + k2(x, x′)
k(x, x′) = k1(x, x′) k2(x, x′)
k(x, x′) = k3(φ(x),φ(x′))

k(x, x′) = x>Ax′

k(x, x′) = ka(xa, x′a) + kb(xb, x′b)
k(x, x′) = ka(xa, x′a) kb(xb, x′b)

c > 0 constant
f (·) any function
q(·) polynomial with

nonneg. coeff.

φ(x) any function to RM

k3(·, ·) valid kernel in RM

A = A>,A � 0
x = (xa, xb)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

485of 821

New Kernels From Other Kernels

Further examples of kernels

k(x, x′) = (x>x′)M only terms of degree M

k(x, x′) = (x>x′ + c)M all terms up to degree M

k(x, x′) = exp
(
−‖x− x′‖2/2σ2

)
Gaussian kernel

k(x, x′) = tanh
(
a x>x′ + b

)
Sigmoidal kernel (invalid)

Generally, we call

k(x, x′) = x>x′ linear kernel
k(x, x′) = k(x− x′) stationary kernel
k(x, x′) = k(‖x− x′‖) homogeneous kernel

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

486of 821

Kernels over Graphs, Sets, Strings, Texts

We ’only’ need an appropriate similarity measure k(x, x′)
which is a kernel.
Example: Given a set A and the set of all subsets of A,
called the power set P(A).
For two subsets A1,A2 ∈ P(A), denote the number of
elements of the intersection of A1 and A2 by |A1 ∩ A2|.
Then it can be shown that

k(A1,A2) = 2|A1∩A2|

corresponds to an inner product in a feature space.
Therefore, k(A1,A2) is a valid kernel function.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

487of 821

Kernels from Probabilistic Generative Models

Given p(x), we can define a kernel

k(x, x′) = p(x) p(x′),

which means two inputs x and x′ are similar if they both
have high probabilities.
Include a weighting function p(i) and extend the kernel to

k(x, x′) =

i

p(x | i) p(x′ | i)p(i).

For a continous variable z

k(x, x′) =

p(x | z) p(x′ | z)p(z)dz.

Hidden Markov Model with sequences of length L.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

488of 821

Kernels for Classification: Summary

Pick a suitable kernel function k(x, x′)
e.g. by computing inner product of some basis functions

Make predictions by suitably combining k(x, xn) for each
training example xn

implicitly, a linear model in some high-dimensional space

For linear regression, we go from

y(x) = φ(x)>(λI + Φ>Φ)−1Φ>t

to
y(x) = k(x)>(K + λ IN)−1t

can plug in suitable kernel function to implicitly perform
nonlinear transformation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

489of 821

Kernels for Classification: Summary

Working with a nonlinear kernel, we are implicitly
performing a nonlinear transformation of our data

Classification with linear kernel k(x, x′) = x>x′

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

490of 821

Kernels for Classification: Summary

Working with a nonlinear kernel, we are implicitly
performing a nonlinear transformation of our data

Classification with nonlinear kernel k(x, x′) = (x>x′)2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Kernel Density
Estimation

Kernel Methods for
Classification

From Feature Functions
to Kernel Methods

Dual Representations

Kernel Construction

491of 821

Kernels for Classification: Summary

We need not explictly work with φ!

Mixture Models and EM 2
EM for Gaussian Mixtures
EM for Gaussian Mixtures – Relation to K-Means
Mixture of Bernoulli Distributions
EM for Gaussian Mixtures – Latent Variables
Convergence of EM