代写 R algorithm graph Bayesian Problem 10.1

Problem 10.1
CS5487 Problem Set 10
Kernels
Antoni Chan Department of Computer Science City University of Hong Kong
Kernel functions
Constructing kernels from kernels
Suppose k1x,z and k2x,z are valid kernels. Prove the following kernels are also valid kernels:
a kernel scaling: b sum:
c product:
d input scaling: e polynomial: f exponential:
kx, z ck1x, z,
kx, z k1x, z k2x, z. kx, z k1x, zk2x, z. kx,z fxk1x,zfz, kx, z k1x, zq,
kx, z expk1x, z.
………
where c 0.
for any function f.
where q is a positive integer.
Problem 10.2 Kernels on real vectors
Let x, z 2 Rn. Show the following are valid kernels,
a Gaussian or RBF: b exponential:
c linear:
d polynomial:
e cosine:
kx,z exp kx zk2, kx,z exp kx zk, kx,z xT Az,
kx,z xT z 1q , kx,z xTz .
for 0.
for 0.
where A is a positive definite matrix. where q is a positive integer.
kxkkzk 2 xz
kx,z expsin 2 x,z 2 R.
f periodic:
Forf,considerthewarpinguxsinx,cosxT andthefactthatkuxuzk2 4sin2xz.
What are the feature transformations corresponding to each of these kernels?
………
Problem 10.3 Constructing kernels from kernels part 2 feature selection
Suppose x,z 2 Rd, x is a function from x to RM, and k3, is a valid kernel in RM. Prove the following is a valid kernel:
a kx, z k3x, z.
Let xa and xb be variables with x xa, xb, and ka and kb are valid kernels over their respective
spaces. Show that the following are valid kernels:
b kx, z kaxa, za kbxb, zb.
c kx, z kaxa, zakbxb, zb.
………
73
2

Problem 10.4 Constructing kernels from kernels part 3 normalization
Some kernels are poorly scaled, in the sense that the dynamic range of the values is much less than the absolute values. One solution to this problem is to normalize the kernel,
k x,zp kx,z . 10.1 kx, xkz, z
This will set the selfsimilarity value to kx, x 1.
a Show that the normalized kernel k is a valid kernel.
b Let x be the feature transformation associated with the kernel k. Show that k is equivalent to calculating the cosine between x and z in the highdimensional feature space.
c What is the range of possible values of the normalized kernel k x, z? ………
Problem 10.5 Constructing kernels from kernels part 4 kernel distance
If kx, z is a valid kernel, then it can be interpreted as a dot product. Hence, we can also construct a distance norm based on this dot product.
kxzk2 xTx2xTzzTz d2x,zkx,x2kx,zkz,z. 10.2 This squareddistance can be used in place the standard L2 norm.
a Show that the exponential kernel distance is a valid kernel:
kx, z expkx, x 2kx, z kz, z. 10.3 This is similar to a Gaussian kernel, but with the L2norm replaced with the kernel distance.
……… Problem 10.6 Kernels for histograms
Let x and z be ddimensional histograms, i.e., x, z 2 Rd, where each bin xi, zi 0. Show that the following kernels between histograms are valid kernels:
a correlation:
b Bhattacharyya:
c histogram intersection: d 2RBF:
kx, z Pdi1 xizi.
kx, z Pdi1 pxipzi. kx, z di1 minxi, zi.
………
74
kx, z exp 2x, z , 2x, z Pd xizi2 .
i1 1xizi 2

Problem 10.7 Kernels on sets
Let X x1, ,xM and Z z1, ,zP be set, where x,z are the set items these could be real vectors or symbols. Define a kernel between items as kx, z.
a Show that the kernel between sets X and Z,
k X, Z X X kx, z,
x2X z2Z
is a valid kernel. Hint: consider the feature transformation X
the feature map induced by kx, z.
b Show that the kernel
k X, Z 1 X X kx, zq, XZ x2X z2Z
P 10.4 Mi1 xi, where xi is
10.5
is a valid kernel, where q is a nonnegative integer. This is called an exponent match kernel. c Show that the kernel, which counts the number of common elements, is a valid kernel,
k X, Z X Z. 10.6 d Show that the following kernel is also valid,
k X, Z 2XZ. 10.7 Another kernel on sets is the pyramid match kernel. There are also other kernels that are not
10.8
10.9
10.10
Consider that the two inputs are distributions, px and qx. The correlation is a valid kernel between distributions Z
positive definite, which have the form using the sumofminimum distances,
kX, Z edX,Z2
dX,Z 1 Xmindxi,zjXmindzi,xj
XZ i j i j or the maxofminimum distances Haussdorf distance,
dX,Z maxmaxmindxi,zj,maxmindzi,xj. ij ij
……… Problem 10.8 Kernels on distributions
kp, q pxqxdx. 10.11
kp,qZ pxqxdx 10.12 is a valid kernel, for any 0. This is called the probability product kernel. A special case is
the Bhattacharyya kernel 0.5, and the correlation kernel 1. ………
75
a Show that the kernel

Problem 10.9 Kernels on sample spaces
Let X be a sample space of random variable x,Zand A and B events, where pA pxdx
10.13
define the kernel
x2A
k : X X ! 1, 1
kA, B pA B pApB
a Show that kA,B is a valid kernel. Hint: consider the feature transformation A 1A
pA, where 1A is the function that takes the value 1 on the set A, and 0 otherwise. ………
Kernel trick
Problem 10.10 Kernel SVM bias term
10.14 10.15
The bias term for the linear SVM can be calculated as
b 1 Xyi wTxi
10.16
SV i2SV where SV ii 0 is the set of support vectors.
a Derive an expression for b for the kernel SVM. ………
Problem 10.11 Kernel logistic regression
Consider the twoclass regularized logistic regression from Problem 8.4, which is interpretable as MAP estimation. The prior distribution on w is zeromean Gaussian with known precision matrix i.e., inverse of the covariance matrix,
pw N w0, 1. Given the training set D X, y, the MAP estimate is
w argmax log pyX, w log pw, w
which can be calculated using the NewtonRaphson method, with the iterations wnew XRXT 1XRz,
where R and z are calculated from the previous wold,
i xTi wold, 1, ,nT, Rdiag111,,n1n,
1e
of logistic regression to obtain kernel logistic regression.
76
10.17 10.18
10.19
z XT wold R1 y.
a 1a is the logistic sigmoid function. In this problem, we will kernelize the MAP version
10.20 10.21 10.22

a Define wT x as the linear function of a novel input x. Use the form in 10.19 to show that the linear function wT x can be kernelized when z is known,
wT x kT K R11z 10.23
where k kx, x1, , kx, xnT is the test kernel, and K kxi, xj ij is the training kernel matrix. Hence, the probability of class 1 for a new point x is . Hint: use a matrix inverse identity Problem 1.15, and note that xTi 1xj is an inner product.
b Using 10.23, show that the z in 10.22 can be calculated iteratively from zold,
old KK R11zold 10.24
z old R1 y, 10.25 where i old. Hence, the NewtonRaphson iterations can also be kernelized for training.
c What happened to the prior covariance 1? What is the relationship between the prior and the kernel function, e.g., when I?
d Is the scale of the kernel important? In other words, is using kxi,xj equivalent to using kxi,xj kxi,xj, for some 0? Why or why not?
………
Problem 10.12 Gaussian process regression nonlinear Bayesian regression
In this problem we will apply the kernel trick to Bayesian linear regression Problem 3.10 to obtain a nonlinear Bayesian regression algorithm, known as Gaussian process regression. The prior distribution on the linear weights is Gaussian, p N 0, . The posterior distribution of the parameters is Gaussian
pD N,,
1 1T 11y,
1 1T 1,
where is the posterior mean and is the posterior covariance. Given a novel input x, the
predictive distribution of the corresponding output f fx, is also Gaussian
pfx, D N f, 2, xT ,
2 xT x,
where the posterior mean and covariance are given in 10.27 and 10.28. We will assume
that the observation noise is i.i.d, i.e., 2I.
a Show that the predictive mean can be written in the form,
T T 2I1y, 10.32 where x. Hint: use a matrix inverse identity from Problem 1.15.
77
i
10.26 10.27 10.28
10.29 10.30 10.31

b Show that the predictive variance 2 can be written in the form,
2 T T T 2I1. 10.33
Hint: use a matrix inverse identity from Problem 1.15.
c The kernel trick can be applied to a and b, by defining setting kxi,xj xiTxj
yielding,
kT K 2I1y 10.34 2 k kT K 2I1k, 10.35
where K kxi,xjij is the kernel matrix, k kx,xii is the test kernel vector, and k kx, x. What happened to the prior covariance ? What is the relationship between the prior covariance and the kernel function kxi,xj?
d Define z K 2I1y. The predictive mean , which is a function of x, can then be written as
Xn i1
where zi is the ith element of z. Hence, the regressed function is a linear combination of kernel functions kx,xi, where xi is fixed. What will be the general shape of the regressed function for the following kernels?
linear: kxi,xj xTi xj.
polynomial: kxi,xjxTi xj 12.
RBF: kxi,xj 1e2kxixjk2.
periodic: kxi,xj 1 exp2 sin2xixj . 2
are the kernel parameters.
e 2 is the variance of the predictive distribution of the function value f, which corresponds to the uncertainty of the prediction high variance means high uncertainty. Under what conditions will the variance be large? When will the variance be small?
………
x
zikx, xi, 10.36
78

Problem 10.13 Kernel perceptron
For a training set D x1,y1,…,xn,yn, the Perceptron algorithm can be implemented as follows.
Algorithm 1 Perceptron algorithm set w 0,b 0,R maxi xi repeat
for i 1,…,n do
if yiwT xi b 0 then
set w w yixi
set b b yiR2 end if
end for
until there are no classification errors
The final classifier is y signwT x b.
a Is the learning rate relevant? Why?
b Show that w learned by the perceptron algorithm must take the form w Pni1 iyixi, where i 0, 8i.
c Using b, show that an equivalent Perceptron algorithm the dual Perceptron is:
Algorithm 2 Perceptron algorithm dual set 0,b 0,R maxi xi
repeat
for i P1,…,n do
ifyi nj1jyjxTjxib0then
set i i 1
set b b yiR2 end if
end for
until there are no classification errors
d Can you give an interpretation to the parameters i? Which among the samples xi are the hardest to classify?
e Apply the kernel trick the the dual perceptron algorithm to obtain the kernel perceptron algo rithm. What is the kernelized decision function?
………
79

Problem 10.14 Kernel kmeans
In this problem we will kernelize the kmeans algorithm.
Consider the original kmeans algorithm. Given a set of points X x1, , xn, where
xi 2 Rd, the goal is to assign a cluster label to each point, yi 2 1, ,K, where K is the number of clusters. The kmeans algorithm calculates the cluster centers j using the current assignment to each cluster K is assumed to be known. In each iteration, Kmeans performs the following two steps:
Cluster Assignment : zij
Estimate Center : j Pn z .
1, j argmink21, ,K kxi kk2 P0, otherwise.
10.37
10.38
ni 1 z i j x i i1 ij
The cluster label for point xi is the label of the closest cluster center, yi argmaxj zij.
The disadvantage of kmeans is that it only works when the clusters can be separated by a hyperplane. This is a consequence of using the Euclidean distance to measure the distance to the cluster center. To apply the kernel trick, we need to rewrite kmeans so that it only uses
innerproducts between the data points xi.
a Show that the squared distance from x to the cluster center k can be written only using
innerproducts as
dx, k kx k k2 xT x 2 N
1 Xn Xn
zikkx,xlN2 zlkzmkkxl,xm. 10.40
1 Xn
1 Xn Xn
zlk xT xl N 2 zlk zmk xTl xm 10.39
k l1
b Apply the kernel trick to 10.39 to obtain the kernelized square distance,
k l1 m1 where Nk Pnl1 zmk is the number of points in cluster k.
2 1 Xn
dx,kkxkk kx,x2N
k l1
k l1 m1
c What is the interpretation of the distance in 10.40 when kx,x0 is the Gaussian kernel?
When will the distance be small?
d Show that the kernel kmeans algorithm is:
CalculateDistances: dxi,kkxi,xi2N
k l1
Cluster Assignment : zij 1, j argmink21, ,K dxi, k 0, otherwise.
A weighted version of Kernel kmeans has interesting connection to many spectral clustering and graph clustering methods see I.S. Dhillon, Y. Guan, B. Kulis, Kernel kmeans: spectral clustering and normalized cuts. Proc. ACM SIGKDD, 2004.
………
80
1 Xn
1 Xn Xn zikkxi,xlN2
zlkzmkkxl,xm,8i 10.41
10.42
k l1 m1

Problem 10.15 Kernel discriminant analysis
In this problem we will derive a kernel version of Fishers linear discriminant FLD, see Problem 7.6 Let X1 x1, ,xn1 and X2 x1, ,xn2 be the matrix of feature vectors from class 1 and class 2, and n1 and n2 are the number of feature vectors for class 1 and class 2, respectively. The class mean and scatter matrix are given by
nj
Sj Xxi jxi jT. 10.43
withinclass scatter,
nj
j 1 Xxi,
nj i1
FLD finds the optimal projection that maximizes the ratio of the betweenclass scatter and the
w argmaxJw, Jw wTSBw, w wTSWw
10.44
i1
where SB and SW are the between and withinclass scatter matrices, SB 1 21 2T, SW S1 S2.
10.45 Let X X1,X2 be all the data. To do the kernelization, we will first assume that the
The optimal projection is w S11 2. W
optimal w can be written as a linear combination of the data points,
Xn i1
a Why is the assumption in 10.46 valid?
b Show that the class mean can be written as
w where 1, ,n are the weights.
ixi X,
10.46
10.47
10.48
and the scatter matrix as
j 1 Xj1, nj
Sj XjI 1 11TXjT, nj
where 1 is a vector of ones.
c Using b, show the mean and scatter matrices when multiplied by w can be written in terms
of ,
where
wTj Tj, wTSjwTSj j1XTXj1, SjXTXjI111TXjTX
10.49
10.50
nj nj 81

d Apply the kernel trick to j and Sj to obtain expressions, j1Kj1, SjKjI111TKj,
10.51
10.52
10.53
10.54
nj nj where Kj is the kernel matrix between X and Xj, i.e.,
xi 2 X, xi 2 Xj. J TSB,
f Show that the optimal is given by
Kji,i0 kxi,xi0, e Show that the kernelized version of 10.44 is
argmaxJ,
where
T S W
SB 2 12 1T , SW S1 S2.
g Finally, show that the kernelized projection of x is
Xn i1
10.56
S 1 2 1 . W
10.55 Hint: similar to the original problem, only the direction of matters, and not the magnitude.
z ………
82
ikx, xi.