Bayesian Quadrature: Model-based Approximate Integration
David Duvenaud University of Cambridge
The Quadrature Problem
We want to estimate an integral
Z =
f (x)p(x)dx
Most computational problems in inference correspond to integrals:
Expectations
Marginal distributions
Integrating out
nuisance parameters
Normalization
constants
Model comparison
x
function f(x) input density p(x)
Sampling Methods
Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 f ( x i )
N
i=1
x
function f(x) input density p(x) samples
Sampling Methods
Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 f ( x i )
N
i=1
Possibly sub-optimal for two reasons:
x
function f(x) input density p(x) samples
Sampling Methods
Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 f ( x i )
N
i=1
Possibly sub-optimal for two
reasons:
Random bunching up
x
function f(x) input density p(x) samples
Sampling Methods
Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 f ( x i )
N
i=1
Possibly sub-optimal for two
reasons:
Random bunching up Often, nearby function
values will be similar
x
function f(x) input density p(x) samples
Sampling Methods
Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 f ( x i )
N
i=1
Possibly sub-optimal for two
reasons:
Random bunching up Often, nearby function
values will be similar
Model-based and quasi-Monte Carlo methods spread out samples to achieve faster convergence.
x
function f(x) input density p(x) samples
Model-based Integration
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
f(x)
p(x)
GP mean
Z GP mean ± SD p(Z)
x
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
Model-based Integration
Place a prior on f , for example, a GP
Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
We’ll call using a GP prior Bayesian Quadrature
Bayesian Quadrature Estimator
Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = zTK−1f(xi)
where zn = k(x, xn)p(x)dx
i=1
Bayesian Quadrature Estimator
Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = zTK−1f(xi)
i=1
where zn = k(x, xn)p(x)dx
Natural to minimize posterior variance of Z:
′ ′′T−1 V[Z|f(xs)]= k(x,x)p(x)p(x)dxdx −z K z
Bayesian Quadrature Estimator
Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = zTK−1f(xi)
i=1
where zn = k(x, xn)p(x)dx
Natural to minimize posterior variance of Z:
′ ′′T−1 V[Z|f(xs)]= k(x,x)p(x)p(x)dxdx −z K z
Doesn’t depend on function values at all!
Bayesian Quadrature Estimator
Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = zTK−1f(xi)
i=1
where zn = k(x, xn)p(x)dx
Natural to minimize posterior variance of Z:
′ ′′T−1 V[Z|f(xs)]= k(x,x)p(x)p(x)dxdx −z K z
Doesn’t depend on function values at all!
Choosing samples sequentially to minimize variance: Sequential Bayesian Quadrature.
Things you can do with Bayesian Quadrature
Can incorporate knowledge of function (symmetries)
f (x , y ) = f (y , x ) ⇔ ks (x , y , x ′ , y ′ ) = k (x , y , x ′ , y ′ ) + k (x , y ′ , x ′ , y ) +k(x′,y,x,y′)+k(x′,y′,x,y)
Things you can do with Bayesian Quadrature
Can incorporate knowledge of function (symmetries)
f (x , y ) = f (y , x ) ⇔ ks (x , y , x ′ , y ′ ) = k (x , y , x ′ , y ′ ) + k (x , y ′ , x ′ , y ) +k(x′,y,x,y′)+k(x′,y′,x,y)
Can condition on gradients
Things you can do with Bayesian Quadrature
Can incorporate knowledge of function (symmetries)
f (x , y ) = f (y , x ) ⇔ ks (x , y , x ′ , y ′ ) = k (x , y , x ′ , y ′ ) + k (x , y ′ , x ′ , y )
+k(x′,y,x,y′)+k(x′,y′,x,y) Posterior variance is a natural convergence diagnostic
Can condition on gradients 0.2
0.1
Z
0
−0.1
−0.2
100 120 140 160 180 200
# of samples
More things you can do with Bayesian Quadrature
Can compute likelihood of GP, learn kernel
Can compute marginals with error bars, in two ways:
More things you can do with Bayesian Quadrature
Can compute likelihood of GP, learn kernel
Can compute marginals with error bars, in two ways: Simply from the GP posterior:
More things you can do with Bayesian Quadrature
Can compute likelihood of GP, learn kernel
Can compute marginals with error bars, in two ways: Or by recomputing fθ(x) for different θ with same x
True σx Marg. Like.
Z
0.2 0.4 0.6 0.8 1 1.2
σx
More things you can do with Bayesian Quadrature
Can compute likelihood of GP, learn kernel
Can compute marginals with error bars, in two ways: Or by recomputing fθ(x) for different θ with same x
True σx Marg. Like.
Z
σx Much nicer than histograms!
0.2 0.4 0.6 0.8 1 1.2
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD Empirical Rates in RKHS
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Empirical Rates out of Expected Variance / MMD RKHS
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD Bound on Bayesian Error
GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
True Function
Evaluations
x
l(x)
GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
True Function
Evaluations GP Posterior
x
l(x)
GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
4
3
2
1
0
True Function
Evaluations GP Posterior
−1
x
x
True Log-func Evaluations
log l(x)
l(x)
GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
4
3
2
1
0
True Function Evaluations GP Posterior
x
True Log-func
Evaluations
GP Posterior
−1
x
log l(x)
l(x)
GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
4
3
2
1
0
True Function
Evaluations
GP Posterior Log-GP Posterior
x
True Log-func
Evaluations
GP Posterior
log l(x)
l(x)
−1
x
GPs vs Log-GPs
200
150
100
50
0
True Function
Evaluations
x
l(x)
GPs vs Log-GPs
200
150
100
50
0
True Function
Evaluations GP Posterior
x
l(x)
GPs vs Log-GPs
200
150
100
50
0
0
−50
−100
−150
True Function
Evaluations GP Posterior
x
True Log-func Evaluations
−200
x
log l(x)
l(x)
GPs vs Log-GPs
200
150
100
50
0
0
−50
−100
−150
True Function Evaluations GP Posterior
−200
x
x
True Log-func
Evaluations
GP Posterior
log l(x)
l(x)
GPs vs Log-GPs
200
150
100
50
0
0
−50
−100
−150
True Function
Evaluations
GP Posterior Log-GP Posterior
x
True Log-func
Evaluations
GP Posterior
−200
x
log l(x)
l(x)
Integrating under Log-GPs
80 60 40 20
0 −20
4 3 2 1 0
−1
True Function
Evaluations
GP Posterior Log-GP Posterior
x
True Log-func
Evaluations
GP Posterior
log l(x)
l(x)
x
Integrating under Log-GPs
80
60 True Function
40 20 0 −20
4 3 2 1 0
−1
Evaluations Mean of GP Approx Log-GP Inducing Points
x
True Log-func
Evaluations
GP Posterior Inducing Points
log l(x)
l(x)
x
Conclusions
Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.
Conclusions
Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.
BQ has nice convergence properties if its assumptions are correct.
Conclusions
Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.
BQ has nice convergence properties if its assumptions are correct.
For inference, GP is not especially appropriate, but other models are intractable.
Limitations and Future Directions
Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive.
Limitations and Future Directions
Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive.
How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations.
Limitations and Future Directions
Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive.
How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations.
It seems unlikely that we’ll find another tractable nonparametric distribution like GPs - should we accept that we’ll need a second round of approximate integration on a surrogate model?
Limitations and Future Directions
Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive.
How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations.
It seems unlikely that we’ll find another tractable nonparametric distribution like GPs - should we accept that we’ll need a second round of approximate integration on a surrogate model?
How much overhead is worthwhile? Bounded rationality work seems relevant.
Limitations and Future Directions
Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive.
How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations.
It seems unlikely that we’ll find another tractable nonparametric distribution like GPs - should we accept that we’ll need a second round of approximate integration on a surrogate model?
How much overhead is worthwhile? Bounded rationality work seems relevant.
Thanks!