CS5487 Problem Set 10
Antoni of Computer Science
Copyright By PowCoder代写 加微信 powcoder
City University of
Kernel functions
Problem 10.1 Constructing kernels from kernels
Suppose k1(x, z) and k2(x, z) are valid kernels. Prove the following kernels are also valid kernels:
(a) kernel scaling: k(x, z) = ck1(x, z), where c > 0.
(b) sum: k(x, z) = k1(x, z) + k2(x, z).
(c) product: k(x, z) = k1(x, z)k2(x, z).
(d) input scaling: k(x, z) = f(x)k1(x, z)f(z), for any function f(·).
(e) polynomial: k(x, z) = k1(x, z)
q, where q is a positive integer.
(f) exponential: k(x, z) = exp(k1(x, z)).
. . . . . . . . .
Problem 10.2 Kernels on real vectors
Let x, z ∈ Rn. Show the following are valid kernels,
(a) Gaussian or RBF: k(x, z) = exp(−α ‖x− z‖2), for α > 0.
(b) exponential: k(x, z) = exp(−α ‖x− z‖), for α > 0.
(c) linear: k(x, z) = xTAz, where A is a positive definite matrix.
(d) polynomial: k(x, z) = (xT z + 1)q, where q is a positive integer.
(e) cosine: k(x, z) = x
(f) periodic: k(x, z) = exp{−α sin2(x−z
)} x, z ∈ R.
For (f), consider the warping u(x) = [sin(x), cos(x)]T and the fact that ‖u(x)− u(z)‖2 = 4 sin2(x−z
What are the feature transformations corresponding to each of these kernels?
. . . . . . . . .
Problem 10.3 Constructing kernels from kernels (part 2 – feature selection)
Suppose x, z ∈ Rd, φ(x) is a function from x to RM , and k3(·, ·) is a valid kernel in RM . Prove the
following is a valid kernel:
(a) k(x, z) = k3(φ(x), φ(z)).
Let xa and xb be variables with x = (xa, xb), and ka and kb are valid kernels over their respective
spaces. Show that the following are valid kernels:
(b) k(x, z) = ka(xa, za) + kb(xb, zb).
(c) k(x, z) = ka(xa, za)kb(xb, zb).
. . . . . . . . .
Problem 10.4 Constructing kernels from kernels (part 3 – normalization)
Some kernels are poorly scaled, in the sense that the dynamic range of the values is much less than
the absolute values. One solution to this problem is to “normalize” the kernel,
k̃(x, z) =
k(x, x)k(z, z)
This will set the self-similarity value to k(x, x) = 1.
(a) Show that the normalized kernel k̃ is a valid kernel.
(b) Let Φ(x) be the feature transformation associated with the kernel k. Show that k̃ is equivalent
to calculating the cosine between Φ(x) and Φ(z) in the high-dimensional feature space.
(c) What is the range of possible values of the normalized kernel k̃(x, z)?
. . . . . . . . .
Problem 10.5 Constructing kernels from kernels (part 4 – kernel distance)
If k(x, z) is a valid kernel, then it can be interpreted as a dot product. Hence, we can also construct
a distance (norm) based on this dot product.
‖x− z‖2 = xTx− 2xT z + zT z ⇒ d2(x, z) = k(x, x)− 2k(x, z) + k(z, z). (10.2)
This squared-distance can be used in place the standard L2 norm.
(a) Show that the exponential kernel distance is a valid kernel:
k̂(x, z) = exp{−α(k(x, x)− 2k(x, z) + k(z, z))}. (10.3)
This is similar to a Gaussian kernel, but with the L2-norm replaced with the kernel distance.
. . . . . . . . .
Problem 10.6 Kernels for histograms
Let x and z be d-dimensional histograms, i.e., x, z ∈ Rd, where each bin xi, zi ≥ 0. Show that the
following kernels between histograms are valid kernels:
(a) correlation: k(x, z) =
(b) Bhattacharyya: k(x, z) =
(c) histogram intersection: k(x, z) =
i=1 min(xi, zi).
(d) χ2-RBF: k(x, z) = exp
−αχ2(x, z)
, χ2(x, z) =
. . . . . . . . .
Problem 10.7 Kernels on sets
Let X = {x1, · · · , xM} and Z = {z1, · · · , zP } be set, where x, z are the set items (these could be
real vectors or symbols). Define a kernel between items as k(x, z).
(a) Show that the kernel between sets X and Z,
k(x, z), (10.4)
is a valid kernel. Hint: consider the feature transformation Φ(X) =
i=1 φ(xi), where φ(xi) is
the feature map induced by k(x, z).
(b) Show that the kernel
k(x, z)q, (10.5)
is a valid kernel, where q is a non-negative integer. This is called an exponent match kernel.
(c) Show that the kernel, which counts the number of common elements, is a valid kernel,
k̃(X,Z) = |X ∩ Z|. (10.6)
(d) Show that the following kernel is also valid,
k̃(X,Z) = 2|X∩Z|. (10.7)
Another kernel on sets is the pyramid match kernel. There are also other kernels that are not
positive definite, which have the form
k(X,Z) = e−αd(X,Z)
using the sum-of-minimum distances,
d(xi, zj) +
or the max-of-minimum distances (Haussdorf distance),
d(X,Z) = max(max
d(xi, zj),max
d(zi, xj)). (10.10)
. . . . . . . . .
Problem 10.8 Kernels on distributions
Consider that the two inputs are distributions, p(x) and q(x). The correlation is a valid kernel
between distributions
p(x)q(x)dx. (10.11)
(a) Show that the kernel
p(x)ρq(x)ρdx (10.12)
is a valid kernel, for any ρ > 0. This is called the probability product kernel. A special case is
the Bhattacharyya kernel (ρ = 0.5), and the correlation kernel (ρ = 1).
. . . . . . . . .
Problem 10.9 Kernels on sample spaces
Let X be a sample space of random variable x, and A and B events, where
p(x)dx (10.13)
define the kernel
k : X × X → [−1, 1] (10.14)
k(A,B) = p(A ∩B)− p(A)p(B) (10.15)
(a) Show that k(A,B) is a valid kernel. Hint: consider the feature transformation Φ(A) = 1A −
p(A), where 1A is the function that takes the value 1 on the set A, and 0 otherwise.
. . . . . . . . .
Kernel trick
Problem 10.10 Kernel SVM bias term
The bias term for the linear SVM can be calculated as
(yi − wTxi) (10.16)
where SV = {i|αi > 0} is the set of support vectors.
(a) Derive an expression for b∗ for the kernel SVM.
. . . . . . . . .
Problem 10.11 Kernel logistic regression
Consider the two-class regularized logistic regression from Problem 8.4, which is interpretable as
MAP estimation. The prior distribution on w is zero-mean Gaussian with known precision matrix
Γ (i.e., inverse of the covariance matrix),
p(w) = N (w|0,Γ−1). (10.17)
Given the training set D = {X, y}, the MAP estimate is
w∗ = argmax
log p(y|X,w) + log p(w), (10.18)
which can be calculated using the Newton-Raphson method, with the iterations
w(new) = (XRXT + Γ)−1XRz, (10.19)
where R and z are calculated from the previous w(old),
(old)), π = [π1, · · · , πn]T , (10.20)
R = diag(π1(1− π1), · · · , πn(1− πn)), (10.21)
z = XTw(old) −R−1(π − y). (10.22)
is the logistic sigmoid function. In this problem, we will kernelize the MAP version
of logistic regression to obtain kernel logistic regression.
(a) Define α∗ = w
Tx∗ as the linear function of a novel input x∗. Use the form in (10.19) to show
that the linear function wTx∗ can be kernelized when z is known,
−1)−1z (10.23)
where k∗ = [k(x∗, x1), · · · , k(x∗, xn)]T is the test kernel, and K = [k(xi, xj)]ij is the training
kernel matrix. Hence, the probability of class 1 for a new point x∗ is π∗ = σ(α∗). Hint: use a
matrix inverse identity (Problem 1.15), and note that xTi Γ
−1xj is an inner product.
(b) Using (10.23), show that the z in (10.22) can be calculated iteratively from z(old),
α(old) = K(K +R−1)−1z(old) (10.24)
z = α(old) −R−1(π − y), (10.25)
where πi = σ(α
i ). Hence, the Newton-Raphson iterations can also be kernelized for training.
(c) What happened to the prior covariance Γ−1? What is the relationship between the prior and
the kernel function, e.g., when Γ = λI?
(d) Is the scale of the kernel important? In other words, is using k(xi, xj) equivalent to using
k̂(xi, xj) = βk(xi, xj), for some β > 0? Why or why not?
. . . . . . . . .
Problem 10.12 Gaussian process regression – nonlinear Bayesian regression
In this problem we will apply the “kernel trick” to Bayesian linear regression (Problem 3.11) to
obtain a non-linear Bayesian regression algorithm, known as Gaussian process regression. The prior
distribution on the linear weights is Gaussian, p(θ) = N (θ|0,Γ). The posterior distribution of the
parameters is Gaussian
p(θ|D) = N (θ|µ̂θ, Σ̂θ), (10.26)
−1 + ΦΣ−1ΦT )−1ΦΣ−1y, (10.27)
−1 + ΦΣ−1ΦT )−1, (10.28)
where µ̂θ is the posterior mean and Σ̂θ is the posterior covariance. Given a novel input x∗, the
predictive distribution of the corresponding output f∗ = f(x∗, θ) is also Gaussian
p(f∗|x∗,D) = N (f∗|µ̂∗, σ̂2∗), (10.29)
µ̂∗ = φ(x∗)
T µ̂θ, (10.30)
σ̂2∗ = φ(x∗)
T Σ̂θφ(x∗), (10.31)
where the posterior mean µ̂θ and covariance Σ̂θ are given in (10.27) and (10.28). We will assume
that the observation noise is i.i.d, i.e., Σ = σ2I.
(a) Show that the predictive mean µ̂∗ can be written in the form,
TΓΦ + σ2I)−1y, (10.32)
where φ∗ = φ(x∗). Hint: use a matrix inverse identity from Problem 1.15.
(b) Show that the predictive variance σ̂2∗ can be written in the form,
∗ Γφ∗ − φT∗ ΓΦ(ΦTΓΦ + σ2I)−1ΦTΓφ∗. (10.33)
Hint: use a matrix inverse identity from Problem 1.15.
(c) The “kernel trick” can be applied to (a) and (b), by defining setting k(xi, xj) = φ(xi)
2I)−1y (10.34)
σ̂2∗ = k∗∗ − kT∗ (K + σ2I)−1k∗, (10.35)
where K = [k(xi, xj)]ij is the kernel matrix, k∗ = [k(x∗, xi)]i is the test kernel vector, and
k∗∗ = k(x∗, x∗). What happened to the prior covariance Γ? What is the relationship between
the prior covariance Γ and the kernel function k(xi, xj)?
(d) Define z = (K + σ2I)−1y. The predictive mean µ̂∗, which is a function of x∗, can then be
written as
zik(x∗, xi), (10.36)
where zi is the ith element of z. Hence, the regressed function is a linear combination of kernel
functions k(x∗, xi), where xi is fixed. What will be the general shape of the regressed function
for the following kernels?
• linear: k(xi, xj) = αxTi xj .
• polynomial: k(xi, xj) = α(xTi xj + 1)
• RBF: k(xi, xj) = α1e−α2‖xi−xj‖
• periodic: k(xi, xj) = α1 exp{−α2 sin2(xi−xj2 )}.
α are the kernel parameters.
(e) σ̂2∗ is the variance of the predictive distribution of the function value f∗, which corresponds
to the uncertainty of the prediction (high variance means high uncertainty). Under what
conditions will the variance be large? When will the variance be small?
. . . . . . . . .
Problem 10.13 Kernel perceptron
For a training set D = {(x1, y1), . . . , (xn, yn)}, the Perceptron algorithm can be implemented as
Algorithm 1 Perceptron algorithm
set w = 0, b = 0, R = maxi ||xi||
for i = 1, . . . , n do
Txi + b) ≤ 0 then
set w ← w + ηyixi
set b← b+ ηyiR2
until there are no classification errors
The final classifier is y∗ = sign(w
(a) Is the learning rate η relevant? Why?
(b) Show that w learned by the perceptron algorithm must take the form w =
i=1 αiyixi, where
αi ≥ 0, ∀i.
(c) Using (b), show that an equivalent Perceptron algorithm (the dual Perceptron) is:
Algorithm 2 Perceptron algorithm (dual)
set α = 0, b = 0, R = maxi ||xi||
for i = 1, . . . , n do
j xi + b) ≤ 0 then
set αi ← αi + 1
set b← b+ yiR2
until there are no classification errors
(d) Can you give an interpretation to the parameters αi? Which among the samples xi are the
hardest to classify?
(e) Apply the kernel trick the the dual perceptron algorithm to obtain the kernel perceptron algo-
rithm. What is the kernelized decision function?
. . . . . . . . .
Problem 10.14 Kernel k-means
In this problem we will kernelize the k-means algorithm.
Consider the original k-means algorithm. Given a set of points X = {x1, · · · , xn}, where
xi ∈ Rd, the goal is to assign a cluster label to each point, yi ∈ {1, · · · ,K}, where K is the number
of clusters. The k-means algorithm calculates the cluster centers µj using the current assignment
to each cluster (K is assumed to be known). In each iteration, K-means performs the following two
Cluster Assignment : zij =
1, j = argmink∈{1,··· ,K} ‖xi − µk‖
0, otherwise.
Estimate Center : µj =
i=1 zijxi∑n
The cluster label for point xi is the label of the closest cluster center, yi = argmaxj zij .
The disadvantage of k-means is that it only works when the clusters can be separated by
a hyperplane. This is a consequence of using the Euclidean distance to measure the distance to
the cluster center. To apply the kernel trick, we need to rewrite k-means so that it only uses
inner-products between the data points xi.
(a) Show that the squared distance from x to the cluster center µk can be written only using
inner-products as
d(x, µk) = ‖x− µk‖2 = xTx− 2
l xm (10.39)
where Nk =
l=1 zmk is the number of points in cluster k.
(b) Apply the kernel trick to (10.39) to obtain the kernelized square distance,
d(x, µk) = ‖x− µk‖2 = k(x, x)− 2
zikk(x, xl) +
zlkzmkk(xl, xm). (10.40)
(c) What is the interpretation of the distance in (10.40) when k(x, x′) is the Gaussian kernel?
When will the distance be small?
(d) Show that the kernel k-means algorithm is:
Calculate Distances : d(xi, µk) = k(xi, xi)− 2
zikk(xi, xl) +
zlkzmkk(xl, xm), ∀i
Cluster Assignment : zij =
1, j = argmink∈{1,··· ,K} d(xi, µk)
0, otherwise.
A weighted version of Kernel k-means has interesting connection to many spectral clustering and
graph clustering methods (see [I.S. Dhillon, Y. Guan, B. Kulis, “Kernel k-means: spectral clustering
and normalized cuts.” Proc. ACM SIGKDD, 2004]).
. . . . . . . . .
Problem 10.15 Kernel discriminant analysis
In this problem we will derive a kernel version of Fisher’s linear discriminant (FLD, see Problem 7.6)
Let X1 = [x1, · · · , xn1 ] and X2 = [x1, · · · , xn2 ] be the matrix of feature vectors from class 1 and
class 2, and n1 and n2 are the number of feature vectors for class 1 and class 2, respectively. The
class mean and scatter matrix are given by
(xi − µj)(xi − µj)T . (10.43)
FLD finds the optimal projection that maximizes the ratio of the “between-class” scatter and the
“within-class” scatter,
w∗ = argmax
J(w), J(w) =
where SB and SW are the between- and within-class scatter matrices,
SB = (µ1 − µ2)(µ1 − µ2)T , SW = S1 + S2. (10.45)
The optimal projection is w∗ = S−1W (µ1 − µ2).
Let X = [X1, X2] be all the data. To do the kernelization, we will first assume that the
optimal w can be written as a linear combination of the data points,
αixi = Xα, (10.46)
where α = [α1, · · · , αn] are the weights.
(a) Why is the assumption in (10.46) valid?
(b) Show that the class mean can be written as
Xj1, (10.47)
and the scatter matrix as
Sj = Xj(I − 1nj 11
T )XTj , (10.48)
where 1 is a vector of ones.
(c) Using (b), show the mean and scatter matrices when multiplied by w can be written in terms
T Ŝjα (10.49)
XTXj1, Ŝj = X
TXj(I − 1nj 11
T )XTj X (10.50)
(d) Apply the kernel trick to µ̂j and Ŝj to obtain expressions,
Kj1, Ŝj = Kj(I − 1nj 11
T )Kj , (10.51)
where Kj is the kernel matrix between X and Xj , i.e.,
[Kj ]i,i′ = k(xi, xi′), xi ∈ X, xi ∈ Xj . (10.52)
(e) Show that the kernelized version of (10.44) is
α∗ = argmax
J(α), J(α) =
ŜB = (µ̂2 − µ̂1)(µ̂2 − µ̂1)T , ŜW = Ŝ1 + Ŝ2. (10.54)
(f) Show that the optimal α∗ is given by
α∗ = Ŝ−1W (µ̂2 − µ̂1). (10.55)
(Hint: similar to the original problem, only the direction of α matters, and not the magnitude.
(g) Finally, show that the kernelized projection of x is
αik(x, xi). (10.56)
. . . . . . . . .
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com