Conditioning on Outputs of Linear Operators
Suppose we have a function f : X → R with a Gaussian process prior distribution: p(f) = GP(f; μ, K).
We have discussed how to perform inference about f when given (noisy) observations of the function at a set of points X: D = (X, y). Here we are going to expand the types of observations we may use during gp inference.
Functionals and linear functionals
Specifically, we are going to consider so-called linear functionals of f. A functional is a function L[f] that takes as an input a function f and returns a scalar. (Functionals are sometimes called “functions of functions.”) A very simple example of a functional is the point-evaluation functional. Let x ∈ X be an arbitrary fixed point in the domain. We define a corresponding functional Lx by
f → Lx[f] = f(x).
So, given a function f , the point-evaluation functional Lx simply evaluates f at x and returns the
result. This is a functional we are very accustomed to using.
A functional is said to be linear when it satisfies a simple linearity property. Specifically, let a ∈ R be an arbitrary scalar constant and let f and g be two arbitrary functions. A functional L is linear if the following equality always holds:
L[af + g] = aL[f] + L[g]. It is easy to see that the point-evaluation functional Lx is linear:
Lx[af + g] = (af + g)(x) = af(x) + g(x) = aLx[f] + Lx[g].
There are several other quite-common linear functionals that we are familiar with. The two we will discuss here are integration against an arbitrary function p(x):
X
f → Dx,i[f] = ∂f(z) . ∂zi z=x
Conditioning on linear functionals
It turns out that we can once again exploit the closure of the Gaussian distribution to linear transformations to condition a gp on f on the observation of any linear functional of f! This will allow us to both perform inference about f given observations of, for example, derivatives of f, and also to perform inference about linear functionals of f directly. This will provide us with a Bayesian mechanism for estimating integrals (a task traditionally called quadrature).
Suppose we have an unknown function f : X → R with the Gaussian process prior above: p(f) = GP(f; μ, K),
1
f → Ip[f] = and (partial) differentiation at a point x:
f(x)p(x)dx,
and let L be a linear functional. We will write l = L[f]. Just as Gaussian distributions are closed under linear transformations, so are Gaussian processes closed under the evaluation of linear functionals! The prior distribution for l is a Gaussian distribution:
p(l) = Nl;L[μ],L2[K]
where
L2[K] = L LK(·, x′) = L LK(x, ·) .
This result is essentially equivalent to the result for linear transformations of Gaussian-distributed vectors we have been using thus far, written with different notation. Notice also that if we consider the point-evaluation functional Lx, we recover a basic result:
pf(x) | x = Nf(x);Lx[μ],L2x[K];= Nf(x);μ(x),K(x,x).
Considering the integration functional, we obtain a perhaps more-interesting result:
p f (x)p(x) dx = N f (x)p(x) dx; μ(x)p(x) dx, K (x, x′ )p(x)p(x′ ) dx dx′ .
Therefore a Gaussian process distribution on f implies a Gaussian distribution on its integral against an arbitrary function p(x)! Further, the problem of estimating the integral of the (perhaps quite complicated) function f has been reduced to the perhaps-simpler problem of integrating the mean and covariance functions μ and K. This is the main idea behind Bayesian quadrature, also called Bayesian Monte Carlo.
Given an observation of L[f] = l, we may condition our prior on this observation in a manner equivalent to that used to derive the posterior distribution of f. Let X be an arbitrary set of input locations. As before, we write the joint distribution between l and f = f(X):
f f μ K ? p l |X =N l ; L[μ] , ? L2[K] ,
where we have defined:
To fill in the missing observations, we need to know the covariance between l and the ith function
μ = μ(X) K = K(X,X). value fi = f(xi). Here we can exploit the linearity of covariance:
cov(f ,l) = covL [f],L[f] = L Lcov(f,f) = L LK = LK(x ,·). i xi xi xi i
Now we have the general result
f f μ K LK(X,·) p l |X =N l ; L[μ] , LK(·,X) L2[K] .
Finally, we may condition this joint distribution on the observed value l = L[f ] to find the posterior of f , which will be an updated multivariate Gaussian distribution. Because the set of points X was
2
arbitrary, we may conclude that the posterior distribution is also a Gaussian process. The posterior mean and covariance functions are
μf|l(x)=μ(x)+LK(x,·) l−L[μ]);
L2[K]
′ ′ LK(x, ·)LK(·, x′)
Kf |l (x, x ) = K (x, x ) − L2 [K ] .
We can easily extend this result to include multiple observations of functionals and also to incorpo-
rate Gaussian noise on each of these observations.
An example is shown in Figure 1, where we condition a Gaussian process prior on the integral
observation 10 f (x) dx = 5. Notice that the posterior samples all have integral exactly equal to 5. 0
Bayesian Quadrature
Above, we conditioned a Gaussian process on an integral observation. In Bayesian quadrature, we do the opposite: given (potentially noisy) observations of a function D = (X, y), we perform inference about an integral of interest, for example the expectation of f under a distribution p:
Ip[f] =
The traditional method for estimating integrals of this form is Monte Carlo estimation, where we
sample some points {xi}Ni=1 from the distribution p(x) and estimate
f(x)p(x)dx.
N f(x)p(x)dx ≈
i=1
In Bayesian quadrature, we place a Gaussian process prior on f, which we condition on the observations D. Notice that the input locations X do not need to be random samples from p, but rather we are allowed to evaluate f anywhere. The result is the posterior
p(f | D) = GP(f;μf|D,Kf|D).
Following the above, we may also derive the posterior distribution of the expectation Ip[f] :
pIp[f] | D = N Ip[f]; μf|D(x)p(x)dx, Kf|D(x,x′)p(x)p(x′)dxdx′ .
For some choices of the prior prior mean and covariance functions μ and K and the distribution p, we may compute the required integrals exactly, giving a closed-form expression for the posterior distribution of the integral of interest.
Why is this useful? The main advantages to this approach are that we may explicitly model the structure of f via the covariance function K, and that the posterior variance of the integral may be used to derive an active sampling scheme, revealing the most-informative points to evaluate the function so as to estimate the integral with the highest precision. Note that the posterior variance of the integral only depends on where we sample the function, and not the actual values we observe. This property can be exploited to design optimal quadrature rules.
3
f(xi).
2
0
−2
0 1 2 3 4 5 6 7 8 9 10
2
0
−2
0 1 2 3 4 5 6 7 8 9 10
x
Figure 1: Above: a Gaussian process prior on a function f with mean zero and squared expo-
nential covariance. Below: the posterior distribution on f after conditioning on the obesrvation
10 f (x) dx = 5. The posterior samples all have integral identically equal to 5. 0
4
x
μ(x) ±2σ samples
μ(x) ±2σ samples