Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
444of 821
Part XIII
Kernel Methods
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
445of 821
Original Input versus Feature Space
Basic linear regression models direct input x.
These models are trivially extended to work on a fixed
nonlinear transformation of the inputs using a vector of
basis functions φ(x).
Example: Use two Gaussian basis functions centered at
the green crosses in the input space.
x1
x2
−1 0 1
−1
0
1
φ1
φ2
0 0.5 1
0
0.5
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
446of 821
Original Input versus Feature Space
Linear decision boundaries in the feature space
correspond to nonlinear decision boundaries in the input
space.
Classes which are NOT linearly separable in the input
space can become linearly separable in the feature space.
BUT: If classes overlap in input space, they will also
overlap in feature space.
Nonlinear features φ(x) cannot remove the overlap.
x1
x2
−1 0 1
−1
0
1
φ1
φ2
0 0.5 1
0
0.5
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
447of 821
Where have we been?
Basis function models (regression, classification)
Flexible basis function models (neural networks)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
448of 821
Where are we going?
Why not use all training data to make predictions for the
test inputs?
Basic ideas:
Continuity : Mostly targets don’t change abruptly.
Similarity : Each training pair (input, target) tells us
something about the possible targets in the neighbourhood
of the input.
Kernels formalise those ideas.
Nonparametric methods: do not rely on a fixed number of
parameters, but rather usually on storing the entire training
set (various loose definitions are used here).
Often leading to Shallow (c.f. Deep) learning.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
449of 821
How are we going there?
0) Histograms
1) Simple density estimation kernels (must only be
non-negative integrable)
2) Implicit feature mapping kernels (must be positive
definite)
Warning:
The term kernel is highly overloaded.
Even today we consider two different types of kernel:
1 Smoothing kernel / density estimation kernel / Parzen
window estimator kernel / Nadaraya-Watson kernel
2 Positive semidefinite kernel / reproducing kernel hilbert
space kernel / implicit feature map kernel / support vector
machine kernel / ≈ Gaussian process covariance function
or kernel
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
450of 821
Kernel Density Estimation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
451of 821
Density Estimation
Suppose we observe data points {xn}Nn=1
e.g. just N real numbers
Suppose we believe these are drawn independently from
some distribution p(x)
e.g. p(x) is Gaussian with unknown mean and variance
Density estimation problem: Estimate p(x) from data
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
452of 821
Nonparametric Density Estimation – Histogram
Partition the space into bins of width ∆i.
Count the number ni of samples falling into each bin i.
Normalise.
pi =
ni
N∆i
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5
Histogram of 50 data points generated from the distribution
shown by the green curve for varying common bin width ∆
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
453of 821
Nonparametric Density Estimation – Histogram
Advantages:
Data can be discarded after calculating the pi (therefore
seems to not be non-parametric, though often described
as such).
Algorithm can be applied to sequentially arriving data.
Disadvantages:
Dependency on bin width ∆i.
Discontinuities due to the bin edges.
Exponential scaling with the dimensionality D of the data.
Need MD bins for D dimensions and M bins per dimension.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
454of 821
Nonparametric Density Estimation – Refined
Draw data from some unknown probability distribution p(x)
in a D-dimensional space.
Consider a small region R containing x. Probability mass
associated with this region
P =
∫
R
p(x) dx
Data set of N observations drawn from p(x). Total number
K of points found inside of R is distributed according to the
binomial distribution
Bin(K |N,P) =
N!
K!(N − K)!
PK(1− P)N−K
Expectation of K : E [K/N] = P
Variance of K : var[K/N] = P(1− P)/N
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
455of 821
Nonparametric Density Estimation – Refined
Expectation of K : E [K/N] = P
Variance of K : var[K/N] = P(1− P)/N
For large N, the distribution will be sharply peaked and
therefore
P ≈ K/N
Assuming also that the region has volume V and the
region is small enough for p(x) to be roughly constant, then
P ≈ p(x)V
Combining two contradictory assumptions
Region R is small enough for p(x) to be roughly constant.
Region R is large enough to have enough K points falling
into it to get a sharp peak for the binomial distribution.
p(x) ≈
K
NV
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
456of 821
Nonparametric Density Estimation – Refined
Two ways to exploit
p(x) ≈
K
NV
1 Fix K and determine the volume V from the data :
K-nearest-neighbours density estimation
2 Fix V and determine K from the data :
kernel density estimation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
457of 821
Nonparametric Estimation – Nearest Neighbour
Fix K and find an appropriate value for V.
Consider a small sphere around x and then allow the
radius to increase until it contains exactly K data points.
Calculate the probability by
p(x) ≈
K
NV
K = 1
0 0.5 1
0
5
K = 5
0 0.5 1
0
5
K = 30
0 0.5 1
0
5
Nearest neighbour density model for different K.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
458of 821
Nonparametric Estimation – Parzen Estimator
Define region R to be a small hypercube around x
Define Parzen window (kernel function)
k(u) =
{
1, |ui| ≤ 1/2, i = 1, . . . ,D
0, otherwise
Total number of data points inside of the hypercube
centered at x with lengths h:
K =
N∑
n=1
k
(
x− xn
h
)
Density estimate for p(x)
p(x) ≈
K
NV
=
1
N
N∑
n=1
1
hD
k
(
x− xn
h
)
Interpret as sum over N cubes centered at each of the xn.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
459of 821
Nonparametric Estimation – Parzen Estimator
Solved problem with Histogram of fixed discretisation of
input space.
Remaining problem: Discontinuities because of the
hypercube (either in or out).
Choose a smoother kernel function (and normalise
correctly).
Common choice : Gaussian kernel
k(x) = N (x|0, h2I) =
1
(2πh2)D/2
exp
{
−
‖x‖2
2h2
}
Can choose any other kernel function k(u) obeying
k(u) ≥ 0,∫
k(u) du = 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
460of 821
Nonparametric Estimation – Parzen Estimator
Gaussian Kernel Density Estimate (KDE):
p(x) =
1
N
N∑
n=1
1
(2πh2)D/2
exp
{
−
‖x− xn‖2
2h2
}
h controls the trade-off between sensitivity to noise and
over-smoothing.
h = 0.005
0 0.5 1
0
5
h = 0.07
0 0.5 1
0
5
h = 0.2
0 0.5 1
0
5
Kernel density model with Gaussian kernel for different h.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
461of 821
Kernel Methods for Classification
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
462of 821
The Role of Training Data
Parametric methods
Learn the model parameter w from the training data t.
Discard the training data t.
Nonparametric methods
Use training data directly for prediction
k-nearest neighbours : use k-closest data from the ’training’
set for classification
Kernel methods
Base prediction on linear combination of kernel functions
evaluated at the training data.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
463of 821
Features
A feature is a measurable property of a phenomenon
being observed or any derived property thereof
raw features: the original data
derived features: mappings of the original features to some
other space (possibly high- or infinite dimensional, e.g.,
basis functions)
Feature selection: which features matter for the problem at
hand?
redundant features
problem dependent
Feature extraction: can we combine the important features
to a smaller set of new features?
compact representation versus ability to explain to a human
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
464of 821
Very simple example – XOR
x1 x2 y = x1xorx2
-1 -1 1
-1 1 -1
1 -1 -1
1 1 1
�
-1.0 -0.5 0.5 1.0
-1.0
-0.5
0.5
1.0
not linearly separable (why?)
raw features {(−1,−1), (−1, 1), (1,−1), (1, 1)}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
465of 821
Very simple example – XOR
x1 x2 xnew = x1 · x2 y = x1xorx2
-1 -1 1 1
-1 1 -1 -1
1 -1 -1 -1
1 1 1 1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
feature extraction: xnew = x1 · x2
data is now separable!
All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions φ(x).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
466of 821
Kernel methods in one slide
Consider a labelled training set {xi, ti}Ni=1
On a new point x, we will predict
y(x) =
N∑
i=1
ai · K(x, xi)
where {ai}Ni=1 are weights to be determined based on our
training set, and K(·, ·) is a kernel function
This is a major departure from the linear models
considered previously!
The kernel function measures the similarity between any
two examples
Prediction is a weighted average of the training targets
Weights depend on the similarity of x to each training
example
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
467of 821
From Feature Functions to Kernels (1)
Suppose we perform linear regression with a feature
matrix Φ and target vector t, where
Φ =
φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)
…
…
. . .
…
φ0(xN) φ1(xN) . . . φM−1(xN)
=
φ(x1)>
φ(x2)>
. . .
φ(xN)>
Recall that the optimal (regularised) w? is
w? = (λI + Φ>Φ)−1Φ>t
Thus, the prediction for feature vector of new point x with
y(x) = φ(x)>w? = φ(x)>(λI + Φ>Φ)−1Φ>t
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
468of 821
From Feature Functions to Kernels (2)
Prediction with optimal (regularised) w?
y(x) = φ(x)>w? = φ(x)>(λI + Φ>Φ)−1Φ>t
Suppose that M is very large. Then, the inverse of an
M ×M matrix above will be expensive to compute.
Consider however the following trick:
φ(x)>(λI + Φ>Φ)−1Φ>t = φ(x)>Φ>(λI + ΦΦ>)−1t
(this is a useful idea, worth verifying)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
469of 821
From Feature Functions to Kernels (3)
We have thus written the prediction as
y(x) = φ(x)>Φ>(λI + ΦΦ>)−1t
=
N∑
i=1
ai · K(x, xi)
as before (check by identifying the ai and K in the above).
Now, our prediction is determined by an N × N rather than
M ×M matrix
ΦΦ> is known as the kernel matrix of the training data
In some sense, measures the similarities between the
training instances
For example, for normalised vectors (the case for some
implicit kernel mappings), the inner product between two
points is a measure of similarity:
argmax
v:‖v‖=‖u‖
〈u, v〉 = u.
NB: the φ(xm)> are never explicitly needed, but only the
inner products φ(xm)>φ(xn) ≡ K(xm, xn). (!)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
470of 821
Dual Representations
Consider a linear regression model with regularised
sum-of-squares error
J(w) =
1
2
N∑
n=1
(w>φ(xn)− tn)2 +
λ
2
w>w
where λ ≥ 0.
We could also write this in more compact form as
J(w) =
1
2
(t−Φw)>(t−Φw) +
λ
2
w>w
with the target vector t = (t1, . . . , tN)>, and the design
matrix
Φ =
φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)
…
…
. . .
…
φ0(xN) φ1(xN) . . . φM−1(xN)
.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
471of 821
Dual Representations
Critical points for J(w)
J(w) =
1
2
(t−Φw)>(t−Φw) +
λ
2
w>w
satisfy
(Φ>Φ + λI)w = Φ>t
λw = Φ> (t−Φw)
w = Φ>a
=
N∑
n=1
φ(xn)an
where a = (a1, . . . , aN)> with components
an = −
1
λ
{
w>φ(xn)− tn
}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
472of 821
Dual Representations
Now express J(w) as a function of this new variable a
instead of w via the relation w = Φ>a
J(a) =
1
2
a>ΦΦ>ΦΦ>a− a>ΦΦ>t +
1
2
t>t +
λ
2
a>ΦΦ>a
where again t = (t1, . . . , tN)>.
Known as the dual representation
Define the N × N Gram matrix K = ΦΦ> with elements
Knm = φ(xn)>φ(xm) = k(xn, xm).
Express J(a) now as
J(a) =
1
2
a>KKa− a>Kt +
1
2
t>t +
λ
2
a>Ka.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
473of 821
Stationary Point of J(a)
Let’s calculate the stationary point for
J(a) =
1
2
a>KKa− a>Kt +
1
2
t>t +
λ
2
a>Ka.
Gradient condition
∇a J(a) = 0 = KKa−Kt + λKa
Therefore
a? = (K + λ IN)−1t.
Hessian is positive semi-definite
Ha J(a) = KK + λK � 0
because ξ>Kξ = ξ>ΦΦ>ξ =
∥∥∥Φ>ξ∥∥∥2 ≥ 0, ∀ξ, and
similarly for the first term.
Hence a? minimises J(a).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
474of 821
Prediction for the Linear Regression Model
Putting a? which minimises the error J(a) into the
prediction model for the linear regression, we get for the
prediction
y(x) = w>φ(x) = a?>Φφ(x) = (Φφ(x))>a?
= k(x)>(K + λ IN)−1t
where we defined the vector k(x) with elements
kn(x) = k(xn, x) = φ(xn)>φ(x).
The prediction y(x) can be expressed entirely in terms of
the kernel function k(x, x′) evaluated at the training and
test data.
Looks familiar? See Bayesian Linear Regression.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
475of 821
The Kernel Function
The kernel function is defined over two points, x and x′, of
the input space
k(x, x′) = φ(x)>φ(x′).
k(x, x′) is symmetric.
It is an inner product of two vectors of basis functions
k(x, x′) = 〈φ(x),φ(x′)〉.
For prediction, the kernel function will be evaluated at the
training data points. (See next slides.)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
476of 821
Dual Representation
What have we gained by the dual representation?
Need to invert an N × N matrix now, where N is the
number of data points. Can be large!
In the parameter space formulation, we “only” needed to
invert an M ×M matrix, where M was the number of basis
functions.
But, a kernel corresponds to an inner product of basis
functions. So we can use a large number of basis functions,
even infinitely many.
We can construct new valid kernels directly from given
ones (whatever the corresponding basis functions of the
new kernel might be).
As a kernel defines a kind of ’similarity’ between two points
in the input space, we can define kernels over graphs,
sets, strings, and text documents.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
477of 821
Kernel Construction
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
478of 821
Kernels from Basis Functions
1 Choose a set of basis functions
{φ1, . . . , φM}
2 Find a new kernel as an inner product between vectors of
basis functions evaluated at x and x′
k(x, x′) = φ(x)>φ(x) =
M∑
i=1
φi(x)φi(x
′)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
479of 821
Kernels from Basis Functions
Polynomial
basis functions
Corresponding kernel
k(x, x′) as function of x for
x′ = −0.5 (red cross).
−1 0 1
−1
−0.5
0
0.5
1
−1 0 1
−0.4
0.0
1.0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
480of 821
Kernels from Basis Functions
Gaussian basis functions
Corresponding kernel
k(x, x′) as function of x for
x′ = 0.0 (red cross).
−1 0 1
0
0.25
0.5
0.75
1
−1 0 1
0.0
1.0
2.0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
481of 821
Kernels from Basis Functions
Logistic Sigmoid
basis functions
Corresponding kernel
k(x, x′) as function of x for
x′ = 0.0 (red cross).
−1 0 1
0
0.25
0.5
0.75
1
−1 0 1
0.0
3.0
6.0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
482of 821
Kernels by Guessing a Kernel Function
1 Choose a mapping from two points of the input space to a
real number, which is symmetric in its arguments, e.g.
k(x, z) = (x>z)2 = k(z, x)
2 Try to write this as an inner product of a vector valued
function evaluated at the arguments x and z, e.g.
k(x, z) = (x>z)2 = (x1 z1 + x2 z2)2
= x21 z
2
1 + 2×1 z2 x2 z2 + x
2
2 z
2
2
= (x21,
√
2 x1 x2, x
2
2)(z
2
1,
√
2 z1 z2, z
2
2)
>
= φ(x)>φ(z)
with the feature mapping φ(x) = (x21,
√
2 x1 x2, x22)
>.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
483of 821
New Kernels From Theory
A necessary and sufficient condition for k(x, x′) to be a
valid kernel is that the kernel matrix K, whose elements
are k(xn, xm), should be positive semidefinite for all
possible choices of the set {xn}.
Previously, we constructed K = ΦΦ>, which is
automatically positive semidefinite (why?)
If we can explicitly construct the kernel via basis functions,
we are good
Even if we cannot find the basis functions easily, we may be
able to deduce k(x, x′) is a valid kernel
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
484of 821
New Kernels From Other Kernels
Given valid kernels k1(x, x′) and k2(x, x′), the following kernels
are also valid:
k(x, x′) = c k1(x, x′)
k(x, x′) = f (x) k1(x, x′) f (x′)
k(x, x′) = q(k1(x, x′))
k(x, x′) = exp(k1(x, x′))
k(x, x′) = k1(x, x′) + k2(x, x′)
k(x, x′) = k1(x, x′) k2(x, x′)
k(x, x′) = k3(φ(x),φ(x′))
k(x, x′) = x>Ax′
k(x, x′) = ka(xa, x′a) + kb(xb, x′b)
k(x, x′) = ka(xa, x′a) kb(xb, x′b)
c > 0 constant
f (·) any function
q(·) polynomial with
nonneg. coeff.
φ(x) any function to RM
k3(·, ·) valid kernel in RM
A = A>,A � 0
x = (xa, xb)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
485of 821
New Kernels From Other Kernels
Further examples of kernels
k(x, x′) = (x>x′)M only terms of degree M
k(x, x′) = (x>x′ + c)M all terms up to degree M
k(x, x′) = exp
(
−‖x− x′‖2/2σ2
)
Gaussian kernel
k(x, x′) = tanh
(
a x>x′ + b
)
Sigmoidal kernel (invalid)
Generally, we call
k(x, x′) = x>x′ linear kernel
k(x, x′) = k(x− x′) stationary kernel
k(x, x′) = k(‖x− x′‖) homogeneous kernel
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
486of 821
Kernels over Graphs, Sets, Strings, Texts
We ’only’ need an appropriate similarity measure k(x, x′)
which is a kernel.
Example: Given a set A and the set of all subsets of A,
called the power set P(A).
For two subsets A1,A2 ∈ P(A), denote the number of
elements of the intersection of A1 and A2 by |A1 ∩ A2|.
Then it can be shown that
k(A1,A2) = 2|A1∩A2|
corresponds to an inner product in a feature space.
Therefore, k(A1,A2) is a valid kernel function.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
487of 821
Kernels from Probabilistic Generative Models
Given p(x), we can define a kernel
k(x, x′) = p(x) p(x′),
which means two inputs x and x′ are similar if they both
have high probabilities.
Include a weighting function p(i) and extend the kernel to
k(x, x′) =
∑
i
p(x | i) p(x′ | i)p(i).
For a continous variable z
k(x, x′) =
∫
p(x | z) p(x′ | z)p(z)dz.
Hidden Markov Model with sequences of length L.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
488of 821
Kernels for Classification: Summary
Pick a suitable kernel function k(x, x′)
e.g. by computing inner product of some basis functions
Make predictions by suitably combining k(x, xn) for each
training example xn
implicitly, a linear model in some high-dimensional space
For linear regression, we go from
y(x) = φ(x)>(λI + Φ>Φ)−1Φ>t
to
y(x) = k(x)>(K + λ IN)−1t
can plug in suitable kernel function to implicitly perform
nonlinear transformation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
489of 821
Kernels for Classification: Summary
Working with a nonlinear kernel, we are implicitly
performing a nonlinear transformation of our data
Classification with linear kernel k(x, x′) = x>x′
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
490of 821
Kernels for Classification: Summary
Working with a nonlinear kernel, we are implicitly
performing a nonlinear transformation of our data
Classification with nonlinear kernel k(x, x′) = (x>x′)2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Kernel Density
Estimation
Kernel Methods for
Classification
From Feature Functions
to Kernel Methods
Dual Representations
Kernel Construction
491of 821
Kernels for Classification: Summary
We need not explictly work with φ!
Mixture Models and EM 2
EM for Gaussian Mixtures
EM for Gaussian Mixtures – Relation to K-Means
Mixture of Bernoulli Distributions
EM for Gaussian Mixtures – Latent Variables
Convergence of EM