Bayesian Optimization
Suppose we have a function f : X → R that we with to minimize on some domain X ⊆ X. That is, we wish to find
x∗ =argminf(x). x∈X
In numerical analysis, this problem is typically called (global) optimization and has been the subject of decades of study. We draw a distinction between global optimization, where we seek the absolute optimum in X, and local optimization, where we seek to find a local optimum in the neighborhood of a given initial point x0.
A common approach to optimization problems is to make some assumptions about f . For example, when the objective function f is known to be convex and the domain X is also convex, the problem is known as convex optimization and has been widely studied. Convex optimization is a common tool used across machine learning.
If an exact functional form for f is not available (that is, f behaves as a “black box”), what can we do? Bayesian optimization proceeds by maintaining a probabilistic belief about f and designing a so- called acquisition function to determine where to evaluate the function next. Bayesian optimization is particularly well-suited to global optimization problems where f is an expensive black-box function; for example, evaluating f might require running an expensive simulation. Bayesian optimization has recently become popular for training expensive machine-learning models whose behavior depend in a complicated way on their parameters (e.g., convolutional neural networks). This is an example of the “AutoML” paradigm.
Although not strictly required, Bayesian optimization almost always reasons about f by choosing an appropriate Gaussian process prior:
p(f) = GP(f; μ, K).
Given observations D = (X, f ),1 we can condition our distribution on D as usual:
p(f | D) = GP(f;μf|D,Kf|D).
Given this set of observations, how do we select where to observe the function next? The meta- approach in Bayesian optimization is to design an acquisition function a(x). The acquisition function is typically an inexpensive function that can be evaluated at a given point that is commensurate with how desirable evaluating f at x is expected to be for the minimization problem. We then optimize the acquisition function to select the location of the next observation. Of course, we have merely replaced our original optimization problem with another optimization problem, but on a much-cheaper function a(x).
Acquisition Functions
Many acquisition functions can be interpreted in the framework of Bayesian decision theory as evaluating an expected loss associated with evaluating f at a point x. We then select the point with the lowest expected loss, as usual.
In the below we drop the f | D subscripts on the mean μ and covariance K functions for f.
1
We will assume these observations to be noiseless here, but we could extend the methods here to the noisy case without difficulty.
1
Probability of improvement
Perhaps the first acquisition function designed for Bayesian optimization was probability of im- provement. Suppose
f′ =minf
is the minimal value of f observed so far. Probability of improvement evaluates f at the point most
2 likelytoimproveuponthisvalue.Thiscorrespondstothefollowingutilityfunction associated
with evaluating f at a given point x:
u(x) = 1 f(x) ≤ f′.
That is, we receive a unit reward if f(x) turns out to be less than f′, and no reward otherwise. The probability of improvement acquisition function is then the expected utility as a function of x:
The loss function associated with probability of improvement is somewhat odd: we get a reward for improving upon the current minimum independent of the size of the improvement! This can sometimes lead to odd behavior, and in practice can get stuck in local optima and underexplore globally.
An alternative acquisition function that does account for the size of the improvement is expected improvement. Again suppose that f ′ is the minimal value of f observed so far. Expected improvement evaluates f at the point that, in expectation, improves upon f′ the most. This corresponds to the following utility function:
u(x)=max0,f′ −f(x).
That is, we receive a reward equal to the “improvement” f ′ − f (x) if f (x) turns out to be less than f′, and no reward otherwise. The expected improvement acquisition function is then the expected utility as a function of x:
The expected improvement has two components. The first can be increased by reducing the mean function μ(x). The second can be increased by increasing the variance K(x, x). These two terms can be interpreted as explicitly encoding a tradeoff between exploitation (evaluating at points with low mean) and exploration (evaluating at points with high uncertainty). The exploitation– exploration tradeoff is a classic consideration in such problems, and the expected improvement criterion automatically captures both as a result of the Bayesian decision theoretic treatment.
2
This is the Bayes action under this loss.
Expected improvement
f′ −∞
0 f(x)>f′
api(x) = Eu(x) | x,D =
= Φf′; μ(x), K(x, x).
(f′ −f)Nf;μ(x),K(x,x)df
The point with the highest expected improvement (the maximal expected utility) is selected.
f′ −∞
Nf;μ(x),K(x,x)df
The point with the highest probability of improvement (the maximal expected utility) is selected.
aei(x)=Eu(x)|x,D=
=f′ −μ(x)Φf′;μ(x),K(x,x)+K(x,x)Nf′;μ(x),K(x,x).
Recall a utility function is simply a negative loss function.
2
Entropy search
A third alternative is entropy search. Here, we seek to minimize the uncertainty we have in the location of the optimal value
x∗ =argminf(x). x∈X
Notice that our belief over f induces a distribution over x∗, p(x∗ | D). Unfortunately, there is no closed-form expression for this distribution.
Entropy search seeks to evaluate points so as to minimize the entropy of the induced distribution p(x∗ | D). Here the utility function is the reduction in this entropy given a new measurement at x, x, f (x):
u(x)=H[x∗ |D]−Hx∗ |D,x,f(x).
As in probability of improvement and expected improvement, we may build an acquisition function by evaluating the expected utility provided by evaluating f at a point x. Due to the nature of the distribution p(x∗ | D), this is somewhat complicated, and a series of approximations must be made.
Upper confidence bound
A final alternative acquisition function is typically known as gp-ucb, where ucb stands for upper confidence bound. gp-ucb is typically described in terms of maximizing f rather than minimizing f; however in the context of minimization, the acquisition function would take the form
aucb(x; β) = μ(x) − βσ(x),
where β > 0 is a tradeoff parameter and σ(x) = K(x, x) is the marginal standard deviation of
f (x).3
Again, the gp-ucb acquisition function contains explicit exploitation (μ(x)) and exploration (σ(x)) terms. Interestingly, the acquisition function cannot be interpreted as computing a natural expected utility function. Nonetheless, strong theoretical results are known for gp-ucb, namely, that under certain conditions, the iterative application of this acquisition function will converge to the true global minimum of f.
3In the context of minimization, this is better described as a lower confidence bound, but ucb is ingrained in the literature as a standard term.
3