Gaussian Process Classification
Just as we could use the kernel trick to extend Bayesian linear regression to Gaussian processes for general-purpose nonlinear regression, we may also extend Bayesian linear classification in the same way. In Gaussian process classification, we assume there is a latent function f : X → R that is commensurate with the probability of a positive observation; higher latent function values correspond to higher probabilities of positive observations. In Bayesian linear classification, we assumed a parametric (linear) form for this latent function:
f(x) = x⊤w.
In Gaussian process classification, rather than choosing a parametric form for f , we instead place a
Gaussian process prior on f:
Note that a Gaussian prior on the weight vector w above induces a Gaussian process prior on f
with mean function
and covariance function
μ(x) = x⊤μ K(x, x′) = x⊤Σx,
p(f) = GP(f; μ, K).
where p(w) = N (w; μ, Σ). The Gaussian process formalism allows us to model arbitrary nonlinear classification boundaries by using any desired mean and covariance function for f.
Likelihood
Suppose we have made binary observations at a set of values X, and define f = f(X) to be the associated set of latent function values. As in Bayesian linear classification, we assume the following likelihood for a given binary observation yi associated with xi:
p(yi = 1 | fi) = σ(fi),
where σ : R → (0, 1) is a monotonically increasing sigmoid function such as the logisitic function or the standard normal cdf. We again assume the observations are conditionally independent given the latent function values:
p(y|f)=p(yi |fi).
Inference
Given our prior p(f) = GP(f; μ, K) and a set of observations D = (X, y), we wish to find the posterior distribution of the latent function values f = f(X). Note that if we had a Gaussian posterior f , this would induce a Gaussian process posterior for the function f given D. The posterior is
p(f | D) = 1 p(f | X)p(y | f) = Nf;μ(X),K(X,X)p(yi | fi).
Z
i
Unfortunately, the sigmoid likelihood coupled with the Gaussian prior do not couple to form a tractable posterior. Instead, we must approximate this posterior in some way. Previously we described the Laplace approximation, which approximates the unnormalized log posterior with a second-order Taylor expansion, resulting in a Gaussian approximate posterior centered at the posterior mode
ˆf = argmaxp(f | D). f
1
Here we will consider two more general-purpose approximation techniques for approximating intractable posterior distributions. These techniques are useful when the likelihood factorizes into one-dimensional terms, as in gp classification.
Assumed Density Filtering
Consider a posterior distribution of the form
1 N
p(θ | D) = Z p0(θ)
where the ti are typically likelihood terms, for example our p(yi | fi) above. We assume the prior p0(θ) has been chosen to be some nice form, for example a Gaussian, and we will use the Gaussian case to illustrate the idea below.
In assumed density filtering (adf), we assume that the posterior has the same form as the prior p0, and we seek an approximating distribution
q(θ) ≈ p(θ | D)
from the same family as p0 that approximates the true posterior “as well as possible.”
Mechanically, ensuring that our approximate distribution q remains in the same family as p0 is achieved by selecting an (unnormalized) member t ̃ from the likelihood conjugate to the prior for
each of the likelihood terms ti. The result is
q ( θ ) = p ( θ ) Z ̃ t ̃ ( θ ; θ ̃ ) ,
i
N 0iii
i=1
and this product will belong to the desired family by conjugacy. Here the constants Z ̃i and the local
parameter vectors {θ ̃i} are free parameters, called site parameters, we may choose for each of the
approximating distributions t ̃ to try to improve the fit of the approximating distribution. i
For example, the Gaussian distribution is self-conjugate, so the adf approximation will in this case take the form
NN
q(θ) = p0(θ)ti(θ) ≈ N(θ;μ,Σ)Z ̃iN(θ;μ ̃i,Σ ̃i),
i=1 i=1
where the site parameters are the constants {Z ̃i} (chosen so that the approximation normalizes) as
well as the local mean vectors {μ ̃i} and covariance matrices Σ ̃i.
Note that in the gp classification case, the likelihood terms in the product are in fact one dimensional,
because the likelihood for yi only depends on fi:
N
q(f) = N(f;μ,Σ)Z ̃iN(fi;μ ̃i,σ ̃i2).
i=1 Consider just the first two terms in this product:
p(θ | D1) ∝ p0(θ)t1(θ). 2
i=1
ti(θ),
This product will not have the same nice form of the prior, but has only been warped “slightly” away from the prior via a single likelihood term. In many cases, this distribution will be at least partially manageable. Perhaps there will not be a nice closed expression, but we might still be able to compute the normalizing constant or the moments of the posterior.
In assumed density filtering, we will approximate this product with a member of the desired family:
q (θ)=Z ̃ p (θ)t ̃(θ;θ ̃ )≈p(θ|D )∝p (θ)t (θ). 11011101
This approximation is done by matching the moments between the approximation q1(θ) and the true posterior p(θ | D1).
For example, consider p0(θ) = N(θ;μ,Σ). We select the site parameters Z ̃−1 μ ̃ Σ ̃
such that
covq1(θ) = covp(θ | D1).
Now the approximate distribution is in the desired density family (Gaussians) and matches the true
posterior up to the second moment.
Now consider the first three terms of the product:
p(θ | D1, D2) ∝ p0(θ)t1(θ)t2(θ) ∝ p(θ | D1)t2(θ).
The idea in assumed density filtering is to substitute in our approximation q1(θ), giving:
p(θ | D1, D2) ∼∝ q1(θ)t2(θ).
Now we are in the same situation we were in before! We have a nice “prior” distribution q1 multiplied
by a single likelihood term t2. We proceed as before, replacing the true likelihood term t2 with an
approximate (conjugate) term Z ̃ t ̃ (θ; θ ̃ ), where we again choose the site parameters (Z ̃ , θ ̃ ) to 222 22
match the moments between our new approximation
q (θ)=Z ̃ q (θ)t ̃(θ;θ ̃ )=Z ̃ Z ̃ p (θ)t ̃(θ;θ ̃ )t ̃(θ;θ ̃ ). 2 21221201122
and the “less-approximate” posterior q1(θ)t2(θ). We proceed in this fashion until we have processed all the local likelihood terms, resulting in the final approximation
Expectation Propagation
111
q1(θ)dθ = 1
Eq1(θ) = Ep(θ | D1)
N
q ( θ ) = p ( θ ) Z ̃ t ̃ ( θ ; θ ̃ ) . 0iii
i=1
One thing to note about assumed density filtering is that our final approximation is dependent on the sequence in which we process the likelihood terms {ti}. Note that we are always matching the moments between
qi−1(θ)ti(θ), 3
the approximation using data up to the ith term as well as the true ith likelihood term, and the new approximation
q(θ)=Z ̃q (θ)t ̃(θ;θ ̃). i ii−1ii
This moment matching at time i therefore never considers data that appears in future terms. For this reason, we might accumulate errors and/or wish to later “revisit” a particular term and update the site parameters (Z ̃i, θ ̃i) in light of future data. This idea leads to expectation propagation, a refinement of assumed density filtering that can address some of these issues.
The idea is simple. Once we have processed each of the likelihood terms, resulting in the approxi- mation above
N
q ( θ ) = p ( θ ) Z ̃ t ̃ ( θ ; θ ̃ ) , 0iii
i=1
we repeatedly revisit each term and update its site parameters. The mechanism for doing so is quite simple. First, we select a site 1 ≤ i ≤ N to update, then form the so-called cavity distribution, which is our approximation using all but the ith term in the product:
q (θ)=p(θ)Z ̃t ̃(θ;θ ̃), −i 0 jj j
j ̸=i
where we have divided by the old site term Z ̃ t ̃ (θ; θ ̃ ). Now we replace the removed site term with
iii
the true likelihood term ti, forming the tilted distribution
q−i(θ)ti(θ).
Finally, we select new site parameters to (re)match the moments between the tilted distribution and
our new approximation:
q (θ)=Z ̃q (θ)t ̃(θ;θ ̃). new i−iii
We proceed continually updating site parameters in this manner until we reach convergence (i.e., none of the site parameters changes very much) or we expend a chosen computational budget.
Theoretical motivation
We conclude with one brief note about the theoretical motivation behind the moment matching used in assumed density filtering and expectation propagation. The Kullback–Leibler (kl) divergence (also called relative entropy) is a notion of “distance” between probability distributions, defined by
p(θ)
dkl p(θ) ∥ q(θ) = p(θ) log q(θ) dθ.
Notice that kl divergence is not a true distance, as it is not symmetric, but it does satisfy dkl(p ∥ q) ≥ 0 with dkl(p ∥ q) = 0 if and only if p(θ) = q(θ) almost everywhere.
A well-known result is that the kl divergence between an arbitrary probability distribution p(θ) and a multivariate Gaussian distribution q(θ) (in the direction dkl(p ∥ q)) is minimized when q is chosen to match the moments of p. Therefore, in the case of Gaussian approximations, these methods can be seen as iteratively building up an approximate posterior in the desired family by minimizing the kl divergence at every step.
Minimizing kl divergence in “the other direction,” dkl(q ∥ p), gives rise to another family of approximation techniques known as variational Bayesian inference.
4