代写 algorithm scala Bayesian The Kernel Trick

The Kernel Trick
Consider the assumption of linear regression with an explicit feature expansion φ : x 􏰳→ φ(x): y(x) = φ(x)⊤w + ε(x).
Given training data D = 􏰫(x, y)􏰬 = (X, y), recall we modeled the residuals ε(x) as zero-mean independent, identically distributed Gaussians with variance σ2:
p(ε) = N (ε; 0, σ2I), giving rise to the following likelihood:
p(y | X,w,σ2) = N(y;Xw,σ2I).
In Bayesian linear regression, we further chose a multivariate Gaussian prior for w: p(w) = N (w; μ, Σ).
For simplicity, in the below we assume the prior mean for w is μ = 0.
Given these assumptions, we were able to derive the posterior distribution of w given the data D:
where
p(w | D,σ2) = N(w;μw|D,Σw|D), μw|D = ΣΦ⊤(ΦΣΦ⊤ + σ2I)−1y;
Σw|D = Σ − ΣΦ⊤(ΦΣΦ⊤ + σ2I)−1ΦΣ, where we have defined Φ = φ(X).
If we wish to use our model to predict the outputs y∗ associated with a set of inputs X∗, we previously derived:
p(y∗ |X∗,D,σ2)=N(y∗;Φ∗μw|D,Φ∗Σw|DΦ⊤∗ +σ2I).
Examining the forms of these expressions, we see that the feature expansion φ always appears in
one of the following expressions:
ΦΣΦ⊤ Φ∗ΣΦ⊤ ΦΣΦ⊤∗ Φ∗ΣΦ⊤∗ .
The entries of these matrices are always of the form φ(x)⊤Σφ(x′), where x and x′ are two arbitrary
inputs. To simply our expressions, we define a function
K(x, x′) = φ(x)⊤Σφ(x′).
Because Σ is positive definite, it has a “matrix square root,” Σ1/2 with the property (Σ1/2)2 = Σ.1
If we define the function ψ(x) = Σ1/2φ(x), we can see that K is simply an inner product: K(x, x′) = ψ(x)⊤ψ(x′).
1You can prove this via the singular value decomposition (svd): write Σ = UDU⊤, where U is unitary and D is diagonalwithpositiveentries(becauseΣispositivedefinite),thendefineΣ1/2 =UD1/2U⊤.
1

Such a function K is guaranteed to always produce positive-definite Gram matrices (a Gram matrix is a square matrix of inner products between pairs of elements), and is called a kernel or covariance function.
Sometimes it is possible to specify a covariance function K directly without ever computing the feature map explicitly. With such a function, we could perform efficient Bayesian linear regression even with a high-dimensional (or even infinite dimensional!) feature expansion φ implicitly. This idea of computing inner products in a feature space directly is called the kernel trick and has been the basis of a large amount of work in the machine-learning community. Effectively, any algorithm that operates purely in terms of inner products between input vectors can be made nonlinear by replacing normal inner products with the evaluation of a kernel.
With this definition, we may rewrite the predictive distribution for y∗: p(y∗ | X∗,D,σ2) = N(y∗;μy∗|D,Ky∗|D),
where
Examples
μy∗|D =K(X∗,X)􏰀K(X,X)+σ2I􏰁−1y;
Ky∗|D =K(X∗,X∗)−K(X∗,X)􏰀K(X,X)+σ2I􏰁−1K(X,X∗).
Perhaps the most-commonly used kernel is the squared exponential covariance function:
′ 2 􏰊 ∥x−x′∥2􏰋 K(x,x;λ,l)=λ exp − 2l2
,
where λ and l are parameters. The former is simply a multiplicative scaling constant (you can think of this as an implicit scalar multiplication in the implicit feature map φ). The latter takes the role of a length scale; vectors separated by more than a couple length scales will have a kernel value near zero.
An example of Bayesian linear regression using this kernel function is shown in Figure 1. We see that the use of this kernel function allowed us to achieve nice nonlinear regression without computing explicit basis expansions. In fact, you can show that the squared exponential kernel corresponds to an infinite-dimensional basis expansion, where we use a Gaussian basis function centered on every point. Such a feature expansion would be impossible to use if we attempted to use explicit feature computation.
Sometimes thinking in terms of the kernel can help even when you have an explicit feature expansion on hand. As an example, imagine our inputs are binary vectors of length n (so each input x is a subset, a member of the power set P(n)). One rather expensive feature expansion we could try would be to enumerate every member of P(n) and define φ(x)i = si ⊂ x, where si is the ith element of the power set. So we represent our set x by a feature vector of length 2n indicating every subset of x. This is a very expensive feature expansion, requiring exponential space to store for each input. However, if we take Σ = I, we can compute the dot product as:
K(x, x′) = 2|x∩x′|, (1) which only requires time and space linear in n!
2

2
1
0
−1 −2
Figure 1: Example of Bayesian linear regression using the squared exponential covariance function. The true function is f = sin(x). The kernel parameters are λ = l = 1, and the noise variance was set to σ2 = 0.12.
y
observations D true function f(x)
y∗|D μ􏰈
±2 diagKy∗|D
−4 −3 −2 −1 0 1 2 3 4
x
3