代写 Bayesian Bayesian Quadrature: Model-based Approximate Integration

Bayesian Quadrature: Model-based Approximate Integration
David Duvenaud University of Cambridge

The Quadrature Problem
􏰭 We want to estimate an integral
􏰖
Z =
f (x)p(x)dx
􏰭 Most computational problems in inference correspond to integrals:
􏰭 Expectations
􏰭 Marginal distributions
􏰭 Integrating out
nuisance parameters
􏰭 Normalization
constants
􏰭 Model comparison
x
function f(x) input density p(x)

Sampling Methods
􏰭 Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 􏰅 f ( x i )
N
i=1
x
function f(x) input density p(x) samples

Sampling Methods
􏰭 Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 􏰅 f ( x i )
N
i=1
􏰭 Possibly sub-optimal for two reasons:
x
function f(x) input density p(x) samples

Sampling Methods
􏰭 Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 􏰅 f ( x i )
N
i=1
􏰭 Possibly sub-optimal for two
reasons:
􏰭 Random bunching up
x
function f(x) input density p(x) samples

Sampling Methods
􏰭 Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 􏰅 f ( x i )
N
i=1
􏰭 Possibly sub-optimal for two
reasons:
􏰭 Random bunching up 􏰭 Often, nearby function
values will be similar
x
function f(x) input density p(x) samples

Sampling Methods
􏰭 Monte Carlo methods: Sample from p(x), take empirical mean:
N
Zˆ = 1 􏰅 f ( x i )
N
i=1
􏰭 Possibly sub-optimal for two
reasons:
􏰭 Random bunching up 􏰭 Often, nearby function
values will be similar
􏰭 Model-based and quasi-Monte Carlo methods spread out samples to achieve faster convergence.
x
function f(x) input density p(x) samples

Model-based Integration

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
f(x)
p(x)
GP mean
Z GP mean ± SD p(Z)
x

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x

Model-based Integration
􏰭 Place a prior on f , for example, a GP
􏰭 Posterior over f implies a posterior over Z.
Z
f(x) p(x)
GP mean
GP mean ±2SD p(Z)
samples
x
􏰭 We’ll call using a GP prior Bayesian Quadrature

Bayesian Quadrature Estimator
􏰭 Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = 􏰅zTK−1f(xi)
where zn = 􏰔k(x, xn)p(x)dx
i=1

Bayesian Quadrature Estimator
􏰭 Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = 􏰅zTK−1f(xi)
i=1
where zn = 􏰔k(x, xn)p(x)dx
􏰭 Natural to minimize posterior variance of Z:
􏰖􏰖′ ′′T−1 V[Z|f(xs)]= k(x,x)p(x)p(x)dxdx −z K z

Bayesian Quadrature Estimator
􏰭 Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = 􏰅zTK−1f(xi)
i=1
where zn = 􏰔k(x, xn)p(x)dx
􏰭 Natural to minimize posterior variance of Z:
􏰖􏰖′ ′′T−1 V[Z|f(xs)]= k(x,x)p(x)p(x)dxdx −z K z
􏰭 Doesn’t depend on function values at all!

Bayesian Quadrature Estimator
􏰭 Posterior over Z has mean linear in f (xs ):
N
Egp [Z|f(xs)] = 􏰅zTK−1f(xi)
i=1
where zn = 􏰔k(x, xn)p(x)dx
􏰭 Natural to minimize posterior variance of Z:
􏰖􏰖′ ′′T−1 V[Z|f(xs)]= k(x,x)p(x)p(x)dxdx −z K z
􏰭 Doesn’t depend on function values at all!
􏰭 Choosing samples sequentially to minimize variance: Sequential Bayesian Quadrature.

Things you can do with Bayesian Quadrature
􏰭 Can incorporate knowledge of function (symmetries)
f (x , y ) = f (y , x ) ⇔ ks (x , y , x ′ , y ′ ) = k (x , y , x ′ , y ′ ) + k (x , y ′ , x ′ , y ) +k(x′,y,x,y′)+k(x′,y′,x,y)

Things you can do with Bayesian Quadrature
􏰭 Can incorporate knowledge of function (symmetries)
f (x , y ) = f (y , x ) ⇔ ks (x , y , x ′ , y ′ ) = k (x , y , x ′ , y ′ ) + k (x , y ′ , x ′ , y ) +k(x′,y,x,y′)+k(x′,y′,x,y)
􏰭 Can condition on gradients

Things you can do with Bayesian Quadrature
􏰭 Can incorporate knowledge of function (symmetries)
f (x , y ) = f (y , x ) ⇔ ks (x , y , x ′ , y ′ ) = k (x , y , x ′ , y ′ ) + k (x , y ′ , x ′ , y )
+k(x′,y,x,y′)+k(x′,y′,x,y) 􏰭 Posterior variance is a natural convergence diagnostic
􏰭 Can condition on gradients 0.2
0.1
Z
0
−0.1
−0.2
100 120 140 160 180 200
# of samples

More things you can do with Bayesian Quadrature
􏰭 Can compute likelihood of GP, learn kernel
􏰭 Can compute marginals with error bars, in two ways:

More things you can do with Bayesian Quadrature
􏰭 Can compute likelihood of GP, learn kernel
􏰭 Can compute marginals with error bars, in two ways: 􏰭 Simply from the GP posterior:

More things you can do with Bayesian Quadrature
􏰭 Can compute likelihood of GP, learn kernel
􏰭 Can compute marginals with error bars, in two ways: 􏰭 Or by recomputing fθ(x) for different θ with same x
True σx Marg. Like.
Z
0.2 0.4 0.6 0.8 1 1.2
σx

More things you can do with Bayesian Quadrature
􏰭 Can compute likelihood of GP, learn kernel
􏰭 Can compute marginals with error bars, in two ways: 􏰭 Or by recomputing fθ(x) for different θ with same x
True σx Marg. Like.
Z
σx 􏰭 Much nicer than histograms!
0.2 0.4 0.6 0.8 1 1.2

Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD

Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD

Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD

Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD Empirical Rates in RKHS

Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Empirical Rates out of Expected Variance / MMD RKHS

Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD Bound on Bayesian Error

GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
True Function
Evaluations
x
l(x)

GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
True Function
Evaluations GP Posterior
x
l(x)

GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
4
3
2
1
0
True Function
Evaluations GP Posterior
−1
x
x
True Log-func Evaluations
log l(x)
l(x)

GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
4
3
2
1
0
True Function Evaluations GP Posterior
x
True Log-func
Evaluations
GP Posterior
−1
x
log l(x)
l(x)

GPs vs Log-GPs for Inference
120 100 80 60 40 20 0 −20
4
3
2
1
0
True Function
Evaluations
GP Posterior Log-GP Posterior
x
True Log-func
Evaluations
GP Posterior
log l(x)
l(x)
−1
x

GPs vs Log-GPs
200
150
100
50
0
True Function
Evaluations
x
l(x)

GPs vs Log-GPs
200
150
100
50
0
True Function
Evaluations GP Posterior
x
l(x)

GPs vs Log-GPs
200
150
100
50
0
0
−50
−100
−150
True Function
Evaluations GP Posterior
x
True Log-func Evaluations
−200
x
log l(x)
l(x)

GPs vs Log-GPs
200
150
100
50
0
0
−50
−100
−150
True Function Evaluations GP Posterior
−200
x
x
True Log-func
Evaluations
GP Posterior
log l(x)
l(x)

GPs vs Log-GPs
200
150
100
50
0
0
−50
−100
−150
True Function
Evaluations
GP Posterior Log-GP Posterior
x
True Log-func
Evaluations
GP Posterior
−200
x
log l(x)
l(x)

Integrating under Log-GPs
80 60 40 20
0 −20
4 3 2 1 0
−1
True Function
Evaluations
GP Posterior Log-GP Posterior
x
True Log-func
Evaluations
GP Posterior
log l(x)
l(x)
x

Integrating under Log-GPs
80
60 True Function
40 20 0 −20
4 3 2 1 0
−1
Evaluations Mean of GP Approx Log-GP Inducing Points
x
True Log-func
Evaluations
GP Posterior Inducing Points
log l(x)
l(x)
x

Conclusions
􏰭 Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.

Conclusions
􏰭 Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.
􏰭 BQ has nice convergence properties if its assumptions are correct.

Conclusions
􏰭 Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.
􏰭 BQ has nice convergence properties if its assumptions are correct.
􏰭 For inference, GP is not especially appropriate, but other models are intractable.

Limitations and Future Directions
􏰭 Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive. Limitations and Future Directions 􏰭 Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive. 􏰭 How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. Limitations and Future Directions 􏰭 Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive. 􏰭 How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. 􏰭 It seems unlikely that we’ll find another tractable nonparametric distribution like GPs - should we accept that we’ll need a second round of approximate integration on a surrogate model? Limitations and Future Directions 􏰭 Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive. 􏰭 How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. 􏰭 It seems unlikely that we’ll find another tractable nonparametric distribution like GPs - should we accept that we’ll need a second round of approximate integration on a surrogate model? 􏰭 How much overhead is worthwhile? Bounded rationality work seems relevant. Limitations and Future Directions 􏰭 Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f (x) is expensive. 􏰭 How to extend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. 􏰭 It seems unlikely that we’ll find another tractable nonparametric distribution like GPs - should we accept that we’ll need a second round of approximate integration on a surrogate model? 􏰭 How much overhead is worthwhile? Bounded rationality work seems relevant. Thanks!