Problem 10.1
CS5487 Problem Set 10
Kernels
Antoni Chan Department of Computer Science City University of Hong Kong
Kernel functions
Constructing kernels from kernels
Suppose k1(x,z) and k2(x,z) are valid kernels. Prove the following kernels are also valid kernels:
(a) kernel scaling: (b) sum:
(c) product:
(d) input scaling: (e) polynomial: (f) exponential:
k(x, z) = ck1(x, z),
k(x, z) = k1(x, z) + k2(x, z). k(x, z) = k1(x, z)k2(x, z). k(x,z) = f(x)k1(x,z)f(z), k(x, z) = k1(x, z)q,
k(x, z) = exp(k1(x, z)).
………
where c > 0.
for any function f(·).
where q is a positive integer.
Problem 10.2 Kernels on real vectors
Let x, z 2 Rn. Show the following are valid kernels,
(a) Gaussian or RBF: (b) exponential:
(c) linear:
(d) polynomial:
(e) cosine:
k(x,z)= exp( ↵ kx zk2), k(x,z)= exp( ↵ kx zk), k(x,z)= xT Az,
k(x,z)= (xT z + 1)q , k(x,z)= xTz .
for ↵ > 0.
for ↵ > 0.
where A is a positive definite matrix. where q is a positive integer.
kxkkzk 2 x z
k(x,z) = exp{ ↵sin ( 2 )} x,z 2 R.
(f) periodic:
For(f),considerthewarpingu(x)=[sin(x),cos(x)]T andthefactthatku(x) u(z)k2 =4sin2(x z).
What are the feature transformations corresponding to each of these kernels?
………
Problem 10.3 Constructing kernels from kernels (part 2 – feature selection)
Suppose x,z 2 Rd, (x) is a function from x to RM, and k3(·,·) is a valid kernel in RM. Prove the following is a valid kernel:
(a) k(x, z) = k3( (x), (z)).
Let xa and xb be variables with x = (xa, xb), and ka and kb are valid kernels over their respective
spaces. Show that the following are valid kernels:
(b) k(x, z) = ka(xa, za) + kb(xb, zb).
(c) k(x, z) = ka(xa, za)kb(xb, zb).
………
73
2
Problem 10.4 Constructing kernels from kernels (part 3 – normalization)
Some kernels are poorly scaled, in the sense that the dynamic range of the values is much less than the absolute values. One solution to this problem is to “normalize” the kernel,
k ̃(x,z)=p k(x,z) . (10.1) k(x, x)k(z, z)
This will set the self-similarity value to k(x, x) = 1.
(a) Show that the normalized kernel k ̃ is a valid kernel.
(b) Let (x) be the feature transformation associated with the kernel k. Show that k ̃ is equivalent to calculating the cosine between (x) and (z) in the high-dimensional feature space.
(c) What is the range of possible values of the normalized kernel k ̃(x, z)? ………
Problem 10.5 Constructing kernels from kernels (part 4 – kernel distance)
If k(x, z) is a valid kernel, then it can be interpreted as a dot product. Hence, we can also construct a distance (norm) based on this dot product.
kx zk2 =xTx 2xTz+zTz ) d2(x,z)=k(x,x) 2k(x,z)+k(z,z). (10.2) This squared-distance can be used in place the standard L2 norm.
(a) Show that the exponential kernel distance is a valid kernel:
kˆ(x, z) = exp{ ↵(k(x, x) 2k(x, z) + k(z, z))}. (10.3) This is similar to a Gaussian kernel, but with the L2-norm replaced with the kernel distance.
……… Problem 10.6 Kernels for histograms
Let x and z be d-dimensional histograms, i.e., x, z 2 Rd, where each bin xi, zi 0. Show that the following kernels between histograms are valid kernels:
(a) correlation:
(b) Bhattacharyya:
(c) histogram intersection: (d) 2-RBF:
k(x, z) = Pdi=1 xizi.
k(x, z) = Pdi=1 pxipzi. k(x, z) = di=1 min(xi, zi).
………
74
k(x, z) = exp ↵ 2(x, z) , 2(x, z) = Pd (xi zi)2 .
i=1 1(xi+zi) 2
Problem 10.7 Kernels on sets
Let X = {x1,··· ,xM} and Z = {z1,··· ,zP} be set, where x,z are the set items (these could be real vectors or symbols). Define a kernel between items as k(x, z).
(a) Show that the kernel between sets X and Z,
k ̃(X, Z) = X X k(x, z),
x2X z2Z
is a valid kernel. Hint: consider the feature transformation (X) =
the feature map induced by k(x, z).
(b) Show that the kernel
k ̃(X, Z) = 1 X X k(x, z)q, |X||Z| x2X z2Z
P (10.4) Mi=1 (xi), where (xi) is
(10.5)
is a valid kernel, where q is a non-negative integer. This is called an exponent match kernel. (c) Show that the kernel, which counts the number of common elements, is a valid kernel,
k ̃(X, Z) = |X \ Z|. (10.6) (d) Show that the following kernel is also valid,
k ̃(X, Z) = 2|X\Z|. (10.7) Another kernel on sets is the pyramid match kernel. There are also other kernels that are not
(10.8)
(10.9)
(10.10)
Consider that the two inputs are distributions, p(x) and q(x). The correlation is a valid kernel between distributions Z
positive definite, which have the form using the sum-of-minimum distances,
k(X, Z) = e ↵d(X,Z)2
d(X,Z) = 1 (Xmind(xi,zj)+Xmind(zi,xj))
|X|+|Z| i j i j or the max-of-minimum distances (Haussdorf distance),
d(X,Z) = max(maxmind(xi,zj),maxmind(zi,xj)). ij ij
……… Problem 10.8 Kernels on distributions
k(p, q) = p(x)q(x)dx. (10.11)
k(p,q)=Z p(x)⇢q(x)⇢dx (10.12) is a valid kernel, for any ⇢ > 0. This is called the probability product kernel. A special case is
the Bhattacharyya kernel (⇢ = 0.5), and the correlation kernel (⇢ = 1). ………
75
(a) Show that the kernel
Problem 10.9 Kernels on sample spaces
Let X be a sample space of random variable x,Zand A and B events, where p(A) = p(x)dx
(10.13)
define the kernel
x2A
k : X ⇥ X ! [ 1, 1]
k(A, B) = p(A \ B) p(A)p(B)
(a) Show that k(A,B) is a valid kernel. Hint: consider the feature transformation (A) = 1A
p(A), where 1A is the function that takes the value 1 on the set A, and 0 otherwise. ………
Kernel trick
Problem 10.10 Kernel SVM bias term
(10.14) (10.15)
The bias term for the linear SVM can be calculated as
b⇤ = 1 X(yi wTxi)
(10.16)
|SV | i2SV where SV = {i|↵i > 0} is the set of support vectors.
(a) Derive an expression for b⇤ for the kernel SVM. ………
Problem 10.11 Kernel logistic regression
Consider the two-class regularized logistic regression from Problem 8.4, which is interpretable as MAP estimation. The prior distribution on w is zero-mean Gaussian with known precision matrix (i.e., inverse of the covariance matrix),
p(w) = N (w|0, 1). Given the training set D = {X, y}, the MAP estimate is
w⇤ = argmax log p(y|X, w) + log p(w), w
which can be calculated using the Newton-Raphson method, with the iterations w(new) = (XRXT + ) 1XRz,
where R and z are calculated from the previous w(old),
⇡i = (xTi w(old)), ⇡ = [⇡1,··· ,⇡n]T, R=diag(⇡1(1 ⇡1),···,⇡n(1 ⇡n)),
1+e
of logistic regression to obtain kernel logistic regression.
76
(10.17) (10.18)
(10.19)
z = XT w(old) R 1(⇡ y).
(a) = 1 a is the logistic sigmoid function. In this problem, we will kernelize the MAP version
(10.20) (10.21) (10.22)
(a) Define ↵⇤ = wT x⇤ as the linear function of a novel input x⇤. Use the form in (10.19) to show that the linear function wT x⇤ can be kernelized when z is known,
↵⇤ = wT x⇤ = k⇤T (K + R 1) 1z (10.23)
where k⇤ = [k(x⇤, x1), · · · , k(x⇤, xn)]T is the test kernel, and K = [k(xi, xj )]ij is the training kernel matrix. Hence, the probability of class 1 for a new point x⇤ is ⇡⇤ = (↵⇤). Hint: use a matrix inverse identity (Problem 1.15), and note that xTi 1xj is an inner product.
(b) Using (10.23), show that the z in (10.22) can be calculated iteratively from z(old),
↵(old) = K(K + R 1) 1z(old) (10.24)
z = ↵(old) R 1(⇡ y), (10.25) where ⇡i = (↵(old)). Hence, the Newton-Raphson iterations can also be kernelized for training.
(c) What happened to the prior covariance 1? What is the relationship between the prior and the kernel function, e.g., when = I?
(d) Is the scale of the kernel important? In other words, is using k(xi,xj) equivalent to using kˆ(xi,xj) = k(xi,xj), for some > 0? Why or why not?
………
Problem 10.12 Gaussian process regression – nonlinear Bayesian regression
In this problem we will apply the “kernel trick” to Bayesian linear regression (Problem 3.10) to obtain a non-linear Bayesian regression algorithm, known as Gaussian process regression. The prior distribution on the linear weights is Gaussian, p(✓) = N (✓|0, ). The posterior distribution of the parameters is Gaussian
p(✓|D) = N(✓|μˆ✓,⌃ˆ✓),
μˆ✓ = ( 1 + ⌃ 1 T ) 1 ⌃ 1y,
⌃ˆ✓ = ( 1 + ⌃ 1 T ) 1,
where μˆ✓ is the posterior mean and ⌃ˆ✓ is the posterior covariance. Given a novel input x⇤, the
predictive distribution of the corresponding output f⇤ = f(x⇤,✓) is also Gaussian
p(f⇤|x⇤, D) = N (f⇤|μˆ⇤, ˆ⇤2), μˆ⇤ = (x⇤)T μˆ✓,
ˆ⇤2 = (x⇤)T ⌃ˆ✓ (x⇤),
where the posterior mean μˆ✓ and covariance ⌃ˆ✓ are given in (10.27) and (10.28). We will assume
that the observation noise is i.i.d, i.e., ⌃ = 2I.
(a) Show that the predictive mean μˆ⇤ can be written in the form,
μˆ⇤ = T⇤ ( T + 2I) 1y, (10.32) where ⇤ = (x⇤). Hint: use a matrix inverse identity from Problem 1.15.
77
i
(10.26) (10.27) (10.28)
(10.29) (10.30) (10.31)
(b) Show that the predictive variance ˆ⇤2 can be written in the form,
ˆ⇤2 = T⇤ ⇤ T⇤ ( T + 2I) 1 ⇤. (10.33)
Hint: use a matrix inverse identity from Problem 1.15.
(c) The “kernel trick” can be applied to (a) and (b), by defining setting k(xi,xj) = (xi)T (xj)
yielding,
μˆ⇤ = k⇤T (K + 2I) 1y (10.34) ˆ⇤2 = k⇤⇤ k⇤T (K + 2I) 1k⇤, (10.35)
where K = [k(xi,xj)]ij is the kernel matrix, k⇤ = [k(x⇤,xi)]i is the test kernel vector, and k⇤⇤ = k(x⇤, x⇤). What happened to the prior covariance ? What is the relationship between the prior covariance and the kernel function k(xi,xj)?
(d) Define z = (K + 2I) 1y. The predictive mean μˆ⇤, which is a function of x⇤, can then be written as
Xn i=1
where zi is the ith element of z. Hence, the regressed function is a linear combination of kernel functions k(x⇤,xi), where xi is fixed. What will be the general shape of the regressed function for the following kernels?
• linear: k(xi,xj) = ↵xTi xj.
• polynomial: k(xi,xj)=↵(xTi xj +1)2.
• RBF: k(xi,xj) = ↵1e ↵2kxi xjk2.
• periodic: k(xi,xj) = ↵1 exp{ ↵2 sin2(xi xj )}. 2
↵ are the kernel parameters.
(e) ˆ⇤2 is the variance of the predictive distribution of the function value f⇤, which corresponds to the uncertainty of the prediction (high variance means high uncertainty). Under what conditions will the variance be large? When will the variance be small?
………
μˆ(x⇤) =
zik(x⇤, xi), (10.36)
78
Problem 10.13 Kernel perceptron
For a training set D = {(x1,y1),…,(xn,yn)}, the Perceptron algorithm can be implemented as follows.
Algorithm 1 Perceptron algorithm set w = 0,b = 0,R = maxi ||xi|| repeat
for i = 1,…,n do
if yi(wT xi + b) 0 then
set w w + ⌘yixi
set b b + ⌘yiR2 end if
end for
until there are no classification errors
The final classifier is y⇤ = sign(wT x⇤ + b).
(a) Is the learning rate ⌘ relevant? Why?
(b) Show that w learned by the perceptron algorithm must take the form w = Pni=1 ↵iyixi, where ↵i 0, 8i.
(c) Using (b), show that an equivalent Perceptron algorithm (the dual Perceptron) is:
Algorithm 2 Perceptron algorithm (dual) set ↵ = 0,b = 0,R = maxi ||xi||
repeat
for i =P1,…,n do
ifyi( nj=1↵jyjxTjxi+b)0then
set ↵i ↵i + 1
set b b + yiR2 end if
end for
until there are no classification errors
(d) Can you give an interpretation to the parameters ↵i? Which among the samples xi are the hardest to classify?
(e) Apply the kernel trick the the dual perceptron algorithm to obtain the kernel perceptron algo- rithm. What is the kernelized decision function?
………
79
Problem 10.14 Kernel k-means
In this problem we will kernelize the k-means algorithm.
Consider the original k-means algorithm. Given a set of points X = {x1, · · · , xn}, where
xi 2 Rd, the goal is to assign a cluster label to each point, yi 2 {1,··· ,K}, where K is the number of clusters. The k-means algorithm calculates the cluster centers μj using the current assignment to each cluster (K is assumed to be known). In each iteration, K-means performs the following two steps:
Cluster Assignment : zij =
Estimate Center : μj = Pn z .
(1, j = argmink2{1,··· ,K} kxi μkk2 P0, otherwise.
(10.37)
(10.38)
ni = 1 z i j x i i=1 ij
The cluster label for point xi is the label of the closest cluster center, yi = argmaxj zij.
The disadvantage of k-means is that it only works when the clusters can be separated by a hyperplane. This is a consequence of using the Euclidean distance to measure the distance to the cluster center. To apply the kernel trick, we need to rewrite k-means so that it only uses
inner-products between the data points xi.
(a) Show that the squared distance from x to the cluster center μk can be written only using
inner-products as
d(x, μk ) = kx μk k2 = xT x 2 N
1 Xn Xn
zikk(x,xl)+N2 zlkzmkk(xl,xm). (10.40)
1 Xn
1 Xn Xn
zlk xT xl + N 2 zlk zmk xTl xm (10.39)
k l=1
(b) Apply the kernel trick to (10.39) to obtain the kernelized square distance,
k l=1 m=1 where Nk = Pnl=1 zmk is the number of points in cluster k.
2 1 Xn
d(x,μk)=kx μkk =k(x,x) 2N
k l=1
k l=1 m=1
(c) What is the interpretation of the distance in (10.40) when k(x,x0) is the Gaussian kernel?
When will the distance be small?
(d) Show that the kernel k-means algorithm is:
CalculateDistances: d(xi,μk)=k(xi,xi) 2N
k l=1
Cluster Assignment : zij = (1, j = argmink2{1,··· ,K} d(xi, μk) 0, otherwise.
A weighted version of Kernel k-means has interesting connection to many spectral clustering and graph clustering methods (see [I.S. Dhillon, Y. Guan, B. Kulis, “Kernel k-means: spectral clustering and normalized cuts.” Proc. ACM SIGKDD, 2004]).
………
80
1 Xn
1 Xn Xn zikk(xi,xl)+N2
zlkzmkk(xl,xm),8i (10.41)
(10.42)
k l=1 m=1
Problem 10.15 Kernel discriminant analysis
In this problem we will derive a kernel version of Fisher’s linear discriminant (FLD, see Problem 7.6) Let X1 = [x1,··· ,xn1] and X2 = [x1,··· ,xn2] be the matrix of feature vectors from class 1 and class 2, and n1 and n2 are the number of feature vectors for class 1 and class 2, respectively. The class mean and scatter matrix are given by
nj
Sj =X(xi μj)(xi μj)T. (10.43)
“within-class” scatter,
nj
μj = 1 Xxi,
nj i=1
FLD finds the optimal projection that maximizes the ratio of the “between-class” scatter and the
w⇤ =argmaxJ(w), J(w)= wTSBw, w wTSWw
(10.44)
i=1
where SB and SW are the between- and within-class scatter matrices, SB =(μ1 μ2)(μ1 μ2)T, SW =S1 +S2.
(10.45) Let X = [X1,X2] be all the data. To do the kernelization, we will first assume that the
The optimal projection is w⇤ = S 1(μ1 μ2). W
optimal w can be written as a linear combination of the data points,
Xn i=1
(a) Why is the assumption in (10.46) valid?
(b) Show that the class mean can be written as
w = where ↵ = [↵1,··· ,↵n] are the weights.
↵ixi = X↵,
(10.46)
(10.47)
(10.48)
and the scatter matrix as
μj = 1 Xj1, nj
Sj =Xj(I 1 11T)XjT, nj
where 1 is a vector of ones.
(c) Using (b), show the mean and scatter matrices when multiplied by w can be written in terms
of ↵,
where
wTμj =↵Tμˆj, wTSjw=↵TSˆj↵ μˆj=1XTXj1, Sˆj=XTXj(I 111T)XjTX
(10.49)
(10.50)
nj nj 81
(d) Apply the kernel trick to μˆj and Sˆj to obtain expressions, μˆj=1Kj1, Sˆj=Kj(I 111T)Kj,
(10.51)
(10.52)
(10.53)
(10.54)
nj nj where Kj is the kernel matrix between X and Xj, i.e.,
xi 2 X, xi 2 Xj. J(↵)= ↵TSˆB↵,
(f) Show that the optimal ↵⇤ is given by
[Kj]i,i0 = k(xi,xi0), (e) Show that the kernelized version of (10.44) is
↵⇤ =argmaxJ(↵), ↵
where
↵ T Sˆ W ↵
SˆB = (μˆ2 μˆ1)(μˆ2 μˆ1)T , SˆW = Sˆ1 + Sˆ2.
(g) Finally, show that the kernelized projection of x is
Xn i=1
(10.56)
↵ ⇤ = Sˆ 1 ( μˆ 2 μˆ 1 ) . W
(10.55) (Hint: similar to the original problem, only the direction of ↵ matters, and not the magnitude.
z = ………
82
↵ik(x, xi).