Machine Learning Kernel Methods
Department of Computer Science University College London
Copyright By PowCoder代写 加微信 powcoder
Lecture Overview
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
Lecture Overview
Lecture Overview
By the end of this lecture you should:
1 Know that Kernel Methods provide a mechanism for tractably accessing non-linear features for use with linear machine learning methods
2 Understand the relationship between feature maps and kernels via Mercer’s Theorem
3 Understand when it is sensible to apply kernel methods via a consideration of the Representer Theorem
The Kernel Trick
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
The Kernel Trick
Kernel methods seek to implicitly map our data into a higher-dimensional feature space so that linear machine learning algorithms can be used in this feature space
Linear methods are well understood and often efficient
An implicit mapping allows us to access high, even infinite, dimensional feature spaces efficiently
The Kernel Trick
Usually kernel based methods will follow two steps:
1 Map the data from the input space, X, to a higher-dimensional feature space, V, using some non-linear mapping, φ : X → V
2 Run a linear machine learning algorithm in this new space
This approach has wide applicability – many learning algorithms can be transformed into non-linear high dimensional methods, for example:
Ridge Regression Support Vector Machine PCA
The Kernel Trick
Input Data:
x = x1 ∈ X
The Kernel Trick
Feature Mapped Data: φ(x)=[x1,x12]T ∈V
The Kernel Trick
Input Data: x=[x1,x2]T ∈X
The Kernel Trick
Feature Mapped Data:
φ(x) = [x12, 2x1x2, x2]T ∈ V
The Kernel Trick
Considerations: Why
Are we justified in applying this strategy?
Why should a linear relationship exist in a higher dimensional space?
Cover’s Theorem gives some intuition
The Kernel Trick
Cover’s Theorem
Assume that our problem is that of binary classification
The theorem states that the probability that n data points can be linearly separated in m dimensions is given by:
n m + 1 1m ,nm+1
n−1 2n−1 i=0 i
Thus the probability of n instances being linearly separable increases with growing dimensionality m
The Kernel Trick
Considerations: How
How can we operate efficiently in a high dimensional space?
For example, consider matrix inversion in ridge regression for an infinite dimensional feature space
We would like to be able to enjoy the benefits of feature mapping without having to perform that mapping or to perform manipulations in elevated dimensions explicitly
The Kernel Trick let’s us do exactly this
The Kernel Trick
Kernel Trick
Consider two input points, x(i),x(j) ∈ X Consider a feature mapping, φ : X → V
We can define an inner product on V, via the kernel function, κ, such that:
κ(x(i), x(j)) = φ(x(i)) · φ(x(j))
The Kernel Trick
Kernel Trick
Let us consider the description of our learning algorithm for a moment:
‘If an algorithm is described solely in terms of inner products, x(i) · x(j), in input space, then it can be lifted into feature space by replacing the occurrence of those inner products by κ(x(i),x(j))’ [Rasmussen & Williams (2006)]
But what kind of algorithm looks like that?
The Kernel Trick
Aside: Ridge Regression
Recall the optimisation problem associated with Ridge Regression:
n ( i ) ( i ) 2 2 min y −w·x +λ∥w∥2
This is equivalent to:
subject to:
ξ(i)2 + λ∥w∥2 (1) ξ(i) = y(i) − w · x(i)
The Kernel Trick
Aside: Ridge Regression &
From the Appendix we see this problem can be solved by seeking the solution to its dual optimisation problem
First we write the Lagrangian for problem (1):
L ( ξ , w , α ) = (1)
ξ + α y − ξ − w · x i=1 i=1
n n (i)2 (i) (i) (i) (i)
2 + λ ∥ w ∥ 2
(n) , …, α
Here α = [α
The dual objective can be written:
D(α) = min L(ξ, w, α) w,ξ
The Kernel Trick
Aside: Ridge Regression &
This is an unconstrained optimisation which we can solve by seeking stationary points:
(i) (i) ∗ ∗
∂ w = − α x + 2 λ w = 0 = ⇒ w = 2 λ α x
∂ L ( i ) ∗ ( i ) ∂ξ(i)=2ξ −α =0
( i ) ∗ α =⇒ ξ =2
The Kernel Trick
Aside: Ridge Regression & Substituting these expressions back into D(α) yields:
1 n D(α)=4 α + α y −2 α
1 n n (i)2
i=1 i=1 i=1
1 n 2λi,j=1
1 n α α x · x + λ 2
(i) (j) (i) (j)
1n n 1n
=−4 α + α y −4λ α α x ·x
α(i)2 + 2λ
α(i)y(i) − λ α(i)α(j)x(i) · x(j) i,j=1
i=1 where: α = 2λα
(i)2 (i) (i) (i) (j) (i) (j)
i=1 i=1 i,j=1 n n n
The Kernel Trick
Aside: Ridge Regression & Dual Problem This leads to the following dual problem:
max −λ∥α∥2 +2α·y−αT(XXT)α α
or: max −αT(XXT +λI)α+2α·y α
The solution to which is:
α = (XXT + λI)−1y
The Kernel Trick
Kernel Trick: Revisited
Again let us consider the description of our learning algorithm:
‘If an algorithm is described solely in terms of inner products, x(i) · x(j), in input space, then it can be lifted into feature space by replacing the occurrence of those inner products by κ(x(i),x(j))’ [Rasmussen & Williams (2006)]
For example, consider the dual problem of Ridge Regression: max −αT(XXT +λI)α+2α·y
=⇒ max − α(i)α(j)xi · xj − λ
α(i)2 + 2 α(i)y(i) i=1
{α(i ) }ni =1
The Kernel Trick
Kernel Trick
This is the Kernel Trick:
It means that, for certain problems, we can enjoy the benefits of
feature mapping without ever having to calculate φ(x) explicitly Instead we can calculate the mapping implicitly by calculating
κ(x(i), x(j)) = φ(x(i)) · φ(x(j))
In particular, it turns out that for many machine learning algorithms, when their associated optimisation problem is re-cast in its dual form, the inner product arises naturally:
We shall investigate the characteristics of such problems later on
Mercer’s Theorem
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
Mercer’s Theorem
What is a Kernel?
We now know that we can exploit the kernel trick to replace the inner products in (certain) learning algorithms with a kernel function
And we would like to be able to work directly with such functions
But how do we know if a particular function, κ, really has a feature map, φ, associated with it such that,
κ(x(i) · x(j)) = φ(x(i)) · φ(x(j))?
And what characterises a valid function, κ?
Mercer’s Theorem
Mercer’s Theorem
Definition:
The , K, is an n × n matrix with elements,
{Kij = κ(x(i), x(j))}n,n where x(i) ∈ Rm i =1,j =1
Definition:
κ : Rm × Rm → R is positive semidefinite (psd) if it is symmetric and if the , K, is psd for all n and for all x(i) ∈ Rm
Mercer’s Theorem
Mercer’s Theorem
Mercer’s Theorem:
A kernel function, κ, is psd if and only if:
κ(x(i), x(j)) = φ(x(i)) · φ(x(j)) where x(i), x(j) ∈ Rm For some feature map, φ : Rm → W (and W)
We call a kernel which satisfies this theorem a
Mercer’s Theorem
Mercer’s Theorem: Aside
Let us prove the only if: If:
n cT Kc =
n ci cj κ(x(i), x(j)) =
ci φ(x(i))
n ·
cj φ(x(j))
κ(x(i), x(j)) = φ(x(i)) · φ(x(j))
This holds for all c ∈ Rn where ci is the i-th element of c
= ∥ciφ(x(i)))∥2 0
Mercer’s Theorem
Polynomial Kernel
The polynomial kernel can be used to compute a compact version
of the polynomial feature mapping up to degree p: κ(x(i), x(j)) = (1 + x(i) · x(j))p
Let’s try expanding for p = 2 and x ∈ R2: κ(x(i), x(j)) = (1 + x(i)x(j) + x(i)x(i))2
(i) (j) (i) (j) (i) (j)2 (i) (j)2
6-dimensional feature space:
√√22√T φ(x) = [1, 2×1, 2×2, x1 , x2 , 2x1x2]
=1+2×1 x1 +2×2 x2 +x1 x1 +x2 x2
In other words our use of the kernel is equivalent to a mapping into
(i) (i) (j) (j) +2×1 x2 x1 x2
Mercer’s Theorem
Radial Basis Function Kernel
The RBF kernel is a flexible measure of similarity parameterised by
a bandwidth hyperparameter, σ2:
(i) (j) ∥x(i) − x(j)∥2
κ(x ,x )=exp − 2σ2
The RBF kernel actually gives access to an infinite dimensional feature space.
Mercer’s Theorem
Document Kernel
Kernels can be constructed for a variety of data types (graphs, categorical data, etc.)
The cosine similarity kernel can be used as a document kernel for the bag-of-words representation:
κ(x(i), x(j)) = x(i) · x(j) ∥x(i ) ∥2 ∥x(j ) ∥2
Here the input attributes are all counts, and so the measure falls in the range [0, 1]
Mercer’s Theorem
We now know that:
Mercer’s Theorem allows us to specify a feature map implicitly if we specify a Mercer kernel
And the Kernel Trick makes clear that if our learning problem can be written in terms of dot products of the input vectors then we can move to this kernel-defined feature space by simply replacing the dot products with the corresponding kernel mapped output
…But how do we know if a particular learning problem can be written in this way…
…For this we need to consider the Representer Theorem
The Representer Theorem
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
The Representer Theorem
Training Sample:
{(x(i), y(i))|x(i) ∈ X ⊆ Rm, y(i) ∈ R}ni=1
Feature Map: Weight Vector:
w ∈ Rk Linear Prediction Function:
Loss Function: And some function:
f (x) = w · φ(x)
L : (y,f(x)) → L(y,f(x)) ∈ R Ω : w → Ω(w) ∈ R
The Representer Theorem
Representer Theorem: Sketch
For a regularised loss function, L, defined such that:
L(y(i), w · φ(x(i))) + Ω(w) minimises L, it admits the following representation:
If and only if Ω(w) is a non-decreasing function of ∥w∥2 then, if w∗
αiφ(x(i)) where: αi ∈ R And therefore for a novel test point, z:
f (z) = w · φ(z) =
αi κ(x(i ) , z)
The Representer Theorem
Representer Theorem This is a powerful result
For learning algorithms whose optimisation problem satisfies the conditions of the representer theorem…
…We need never directly seek w∗ (whose dimensions are possibly infinite)
Instead we need only seek the n parameters which characterise α Furthermore, when evaluating a test point we need never explicitly
calculate φ(z)
Instead we need only sum over a weighted set of n kernel function outputs, evaluated as the contraction of z and each of the training points {x(i)}ni=1
The Representer Theorem
Representer Theorem: Sparsity
Note that a sparsity-inducing regulariser sich as Ω(w) = ∥w∥1, as in the LASSO, does not satisfy the conditions of the Representer Theorem
This explains why we cannot use the kernel trick in conjunction with this sort of feature selection
Sparsity comes at a price!
Example: Ridge Regression
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
Example: Ridge Regression
Ridge Regression: Primal Form
Recall our Ridge Regression optimisation problem, where we have
elevated our input attributes to a k-dimensional feature space via themap,φ:Rm →Rk:
n ( i ) ( i ) 2 2 y −w·φ(x ) +λ∥w∥2
Here the so-called primal solution is generated by the revised
i=1 normal equations:
w = (XT X + λI)−1XT y (2) Where X is the n × k feature mapped design matrix,
X = [φ(x(1)), φ(x(2)), …, φ(x(n))]T
And the associated prediction for a novel test point z is: f (z) = w · φ(z) = yT X(XT X + λI)−1φ(z)
Example: Ridge Regression
Ridge Regression: Dual Form
Left multiply equation (1) by (XT X + λI):
XT Xw + λw = XT y (3) w = λ−1XT (y − Xw)
α = λ−1(y − Xw) λα = (y − Xw)
λα = (y − XXT α)
From equation (3)
α = (XXT + λI)−1y This is the so-called dual solution
Example: Ridge Regression
Ridge Regression: Dual Form
And the associated prediction for a novel test point z is:
f (z) = w · φ(z)
=yT(XXT +λI)−1Xφ(z)
= αiφ(x(i))·φ(z)
In fact the same result follows from:
The Representer Theorem
The Lagrangian Dual solution…as we observed earlier
Example: Ridge Regression
Ridge Regression: Kernel Version We may re-write this using kernels as:
α = (XXT + λI)−1y (XXT)ij =κ(x(i),x(j))
We see that φ(x) is never explicitly evaluated either within the calculation of αi or within the kernelised prediction function
αiκ(x(i),z)
Example: Ridge Regression
Ridge Regression: The Value of the Kernel Trick
We note that the primal solution involves an inversion of the design matrix, XT X, an operation of complexity O(k3)
…and prediction involves an operation of complexity O(k)
Meanwhile the dual solution involves an inversion of the Gram matrix, XXT , an operation of complexity O(n3)
…and prediction involves an operation of complexity O(n|κ|) (where |κ| is the number of operations involved in the caluclation of an inner product)
Thus, for k ≫ n the value of the dual form and the kernel trick becomes apparent
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
1 We can attempt to enhance the linear algorithms of machine learning by affecting a feature map of the input attributes in order to elevate them to a higher dimensional non-linear space
2 This is a sensible and a tractable strategy if we can employ the Kernel Trick and work only via the dot product of feature mapped data points
3 The kernel trick is manifest iff a Representer Theorem holds for our optimisation problem
4 In this case, Mercer’s Theorem offers us a mechanism for defining a valid feature space via s, without the need to ever explicitly define the feature mapping
Lecture Overview
1 Lecture Overview
2 The Kernel Trick
3 Mercer’s Theorem
4 The Representer Theorem
5 Example: Ridge Regression
7 Appendix:
Multiple Constraints: Problem
subject to:
min f (x) x∈Rn
{g(i)(x) 0}mi=1 {h(j)(x) = 0}pj=1
Multiple Constraints: Lagrangian We express the Lagrangian as:
L(x,λ,μ) = f(x) +
μ(i)g(i)(x) +
λ(j)h(j)(x)
λ = [λ(1), …, λ(p)]T , {λ(j) ∈ R}pj=1;
μ = [μ(1), …, μ(m)]T , {μ(i) ∈ R0}mi=1; are Lagrange multipliers
Multiple Constraints: Problem Reformulation
And we can solve our problem by seeking stationary solutions
(x∗, {μ(i)∗}, {λ(j)∗}) which satisfy the following: ∇xL = 0
{g(i)(x) 0}mi=1, {h(j)(x) = 0}pj=1 subject to: {μ(i) 0}mi=1
{μ(i)g(i)(x) = 0}mi=1
Duality: Primal Problem
The original problem is sometimes know as the primal problem,
and its variables, x, are known as the primal variables
It is equivalent to the following formulation:
min max L(x, λ, μ) x λ,μ0
Here the bracketed term is known as the primal objective function
Duality: Barrier Function
We can re-write the primal objective as follows:
m p max L(x,λ,μ)=f(x)+ max μ(i)g(i)(x)+ λ(j)h(j)(x)
λ,μ0 λ,μ0
Here the second term gives rise to a barrier function which enforces the constraints as follows:
max μ(i)g(i)(x) + λ,μ0 i=1
λ(j)h(j)(x) = 0 if x is feasible ∞ if x is infeasible
Duality: Minimax Inequality
In order to make use of this barrier function formulation, we will
need the minimax inequality:
max min φ(x, y) min max φ(x, y)
This is true for all y, therefore, in particular the following is true:
max min φ(x, y) max φ(x, y) ∀x yxy
This is true for all x, therefore, in particular the following is true:
max min φ(x, y) min max φ(x, y) yx xy
min φ(x, y) φ(x, y) ∀x, y x
Duality: Weak Duality
We can now introduce the concept of weak duality:
min max L(x,λ,μ) max minL(x,λ,μ) x λ,μ0 λ,μ0 x
Here the bracketed term on the right hand side is known as the dual objective function, D(λ, μ)
If we can solve the right hand side of the inequality then we have a lower bound on the solution of our optimisation problem
Duality: Weak Duality
And often the RHS side of the inequality is an easier problem to solve, because:
minx L(x, λ, μ) is an unconstrained optimisation problem for a given value of (λ, μ)…
…And if solving this problem is not hard then the overall problem is not hard to solve because:
maxλ,μ0 [minx L(x, λ, μ)] is a maximisation problem over a set of affine functions – thus it is a concave maximisation problem or equivalently a convex minimisation problem, and we know that such problems can be efficiently solved
Note that this is true regardless of whether f,g(i),h(j) are nonconvex
Duality: Strong Duality
For certain classes of problems which satisfy constraint qualifications we can go further and strong duality holds:
min max L(x,λ,μ) = max minL(x,λ,μ) x λ,μ0 λ,μ0 x
There are several different constraint qualifications. One is Slater’s Condition which holds for convex optimisation problems
Recall, these are problems for which f is convex and g(i), h(j) are convex sets
For problems of this type we may seek to solve the dual optimisation problem:
max minL(x,λ,μ) λ,μ0 x
Duality: Strong Duality
Another reason for adopting the dual optimisation approach to solving contrained optimisation problems is based on dimensionality:
If the dimensionality of the dual variables, (m + p), is less than the dimensionality of the primal variables, n, then dual optimisation often offers a more efficient route to solutions
This is of particular importance if we are dealing with infinite dimensional primal variables