Inference and assessing convergence
Bayesian Statistics Statistics 4224/5224 Spring 2021
March 18, 2021
1
The following notes summarize Sections 10.5 and 11.4–11.5 of Bayesian Data Analysis, by Andrew Gelman et al.
A nice introduction to MCMC convergence diagnostics is also given in Section 3.4 of Bayesian Statistical Methods, by Reich and Ghosh.
How many simulation draws are needed?
Bayesian inferences are usually most conveniently summarized by random draws from the posterior distribution of the model parameters.
Percentiles of the posterior distribution of univariate estimands can be reported to convey the shape of the distribution.
2
For example, reporting the 2.5%, 25%, 50%, 75%, and 97.5% points of sampled distributions of an estimand provides a 50% and a 95% posterior interval, and also conveys skewness in its marginal posterior density.
We also use posterior simulations to make inferences about pre- dictive quantities. Given each draw θs, we can sample any pre-
ss
dictive quantity, y ̃ ∼ p(y ̃|θ ). Posterior inferences and probability
calculations can then be performed for each predictive quantity using the S simulations.
Our goal in a Bayesian computation is to obtain a set of indepen- dent draws θs, s = 1, . . . , S, from the posterior distribution, with enough draws S so that quantities of interested can be estimated with reasonable accuracy.
3
For most examples, S = 100 independent draws are enough for reasonable posterior summaries.
We can see this by considering a scalar parameter θ with an approximately normal posterior distribution with mean μθ and standard deviation σθ.
We assume these cannot be calculated analytically and instead are estimated from the mean θ ̄ and standard deviation sθ of the S simulation draws.
The posterior mean is then estimated to an accuracy of approx-
√
imately sθ / S .
4
The total standard deviation of the computational parameter es- timate (including Monte Carlo error, the uncertainty contributed by having only a finite number of simulation draws) is then s 1+1/S.
For S = 100, the factor 1 + 1/S is 1.005, implying that Monte Carlo error adds almost nothing to the uncertainty coming from actual posterior variance.
However, it can be convenient to have more than 100 simulations just so that the numerical summaries are more stable, even if this stability typically confers no important practical advantage.
In general, fewer simulations are needed to estimate posterior medians of parameters, probabilities near 0.5, and low-dimensional summaries than extreme quantiles, posterior means, probabilities of rare events, and higher-dimensional summaries.
5
θ
Inference and assessing convergence
The basic method of inference from iterative simulation is the same as for Bayesian simulation in general: use the collection of all the simulated draws from p(θ|y) to summarize the pos- terior density and to compute quantiles, moments, and other summaries of interest as needed.
Posterior predictive simulations of unobserved outcomes y ̃ can be obtained by simulation conditional on the drawn values of θ.
Inference using the iterative simulation draws (MCMC) requires some care, however, as we discuss in this section.
6
Difficulties of inference from iterative simulation
Iterative simulation adds two challenges to simulation inference.
• First, if the iterations have not proceeded long enough, the simulations may be be grossly unrepresentative of the target distribution.
• The second problem with iterative simulation draws is their within-sequence correlation; aside from any convergence is- sues, simulation inference from correlated draws is gener- ally less precise than from the same number of independent draws.
We handle the special problems of iterative simulation in three ways:
7
• First, we attempt to design the simulation runs to allow ef- fective monitoring of convergence, in particular by simulating multiple sequences with starting points dispersed throughout the parameter space.
• Second, we monitor convergence of all quantities of inter- est by comparing variation between and within simulated sequences until ‘within’ variation roughly equals ‘between’ variation. Only when the distributions of each simulated se- quence is close to the distribution of all the sequences mixed together can they all be approximating the target distribu- tion.
• Third, if the simulation efficiency is unacceptably low (in the sense of requiring too much real time on the computer to obtain approximate convergence of posterior inferences for quantities of interest), the algorithm can be altered.
8
Discarding early iterations of the simulation runs
To diminish the influence of the starting values, we generally discard the first half of each sequence and focus attention on the second half.
Our inferences will be based on the assumption that the distri- butions of the simulated values θt, for large enough t, are close to the target distribution, p(θ|y).
We refer to the practice of discarding early iterations in Markov chain simulation as warm-up; depending on the context, different warm-up fractions can be appropriate.
We adopt the general practice of discarding the first half as a conservative choice.
9
Dependence of the iterations in each sequence
Another issue that sometimes arises, once approximate conver- gence has been reached, is whether to thin the sequences by keeping every kth simulation draw from each sequence and dis- carding the rest.
Whether or not the sequences are thinned, if the sequences have reached approximate convergence, they can be directly used for inferences about the parameters θ and any other quantities of interest.
10
Multiple sequences with overdispersed starting points
Our recommended approach to assessing convergence of iter- ative simulation is based on comparing different simulated se- quences, as illustrated in the following figure which shows five parallel simulations before and after approximate convergence.
50 iterations 500 iterations 5 chains combined
● ●
● ● ● ● ●
● ● ●●●● ● ●
● ● ● ● ● ●●● ● ●●●●●●●●●●●● ●●●●●● ● ●● ●● ● ●●● ● ● ●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●● ●
● ● ● ● ● ●●
●
● ● ● ●●● ●● ●● ●
●
● ● ● ●●●● ●●● ●●●●●●● ●●●● ●● ●● ●●●●●●●
● ● ●
●● ●● ●●●● ● ●
● ● ●
●●●● ● ● ●●●● ●● ● ●● ● ● ● ●●●● ● ● ●● ●● ●●● ●●●● ●● ● ●●●●●● ● ●●●●●●●●●●●● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●
●
● ● ● ● ● ● ●
●●
●
●
● ●●●●●●●●
●●●● ●
● ●●● ●●●● ● ●●
●
●●● ●●●●●●● ●● ●● ● ● ● ● ● ●● ● ● ●●●●
● ● ●●●●● ●●●●●●●● ● ●●● ●● ●●● ●● ●
●
●
● ● ● ●
●
● ●
● ● ●●
●
● ● ●
●
● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ● ●
●
● ●● ●●
● ●●● ●
● ● ● ● ● ● ● ●
●
● ●● ●●●●● ●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●
● ●●●● ●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ● ● ● ● ●
● ● ● ●
● ●
● ● ●
● ●● ● ●●
● ●● ●● ●●● ●●●●●●●●●●●●●●●●● ● ●●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●●● ●●
●
●● ●●●●● ●
● ●
●
●●● ● ● ● ●● ●
●●●●●●●●● ●●●●● ● ●●●●●●●●●●● ●●● ● ● ●●●●● ●●●●●●●●●●● ●●●●● ●●●●●●●●●●●
● ●● ●
●
●● ● ● ●
●● ●●●●●●●●● ●●●●● ●●●● ●● ● ●● ●● ●● ●●●●●●●● ●●●●●●● ●●●● ●● ● ● ● ● ● ●● ●●●●● ●●●●●●●●●●●●●●●
● ●●● ● ● ● ● ●● ●●
● ●
● ●●● ●
● ● ●
● ●
●● ●●●● ●●●●●● ●●● ● ●●● ● ● ● ●●● ●
●● ●●●●● ●● ● ● ●● ●●● ● ● ●
●● ●
●
●
●
● ● ● ●●●●●●● ● ● ●● ● ●●●● ●●●●●●
●●
●●● ●●● ●●●●●●●●●●● ● ●●●●● ● ●
●● ●●●●●●● ●● ● ●●●
●●● ● ● ●●●●● ●
● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●
● ●●● ●● ●●●●● ●
●● ● ●
●
●●
●
● ●●●● ●
●
●
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
11
−4 −2 0 2 4
−4 −2 0 2 4
−4 −2 0 2 4
In the left-hand plot the multiple sequences have clearly not converged—the variance within each sequence is much less than the variance between sequences.
Later, in the middle plot, the sequences have mixed, and the two variance components are essentially equal.
Monitoring scalar estimands
We monitor each scalar estimand or other scalar quantities of interest separately. Estimands include all the parameters of in- terest in the model and any other quantities of interest (for example, the ratio of two parameters or the value of a predicted future observation).
12
Monitoring convergence: mixing and stationarity
Figure 11.3 on page 283 illustrates two of the challenges of monitoring convergence of iterative simulations.
The first graph shows two sequences, each of which looks fine on its own, but when looked at together reveal a clear lack of convergence. To achieve convergence, the sequences must have mixed.
The second graph shows two chains that have mixed, in the sense that they have traced out a common distribution, but they do not appear to have converged. To achieve convergence, each individual sequence must reach stationarity.
13
Splitting each saved sequence into two parts
We diagnose convergence (separately, for each scalar quantity of interest) by checking mixing and stationarity.
There are various ways to do this; we apply a fairly simple ap- proach in which we split each chain in half and check that all the remaining half-sequences have mixed.
We start with some number of simulated sequences in which the warm-up period (by default set to the first half of the simula- tions) has already been discarded.
We then take each of these chains and split them into the first and second half (this is after discarding the warm-up iterations).
14
Let m be the number of chains (after splitting) and n be the length of each chain.
For example, suppose we simulate 5 chains, each of length 1000, and then discard the first half of each as warm-up.
We are then left with 5 chains, each of length 500, and we split each into two parts: iterations 1–250 (originally iterations 501– 750) and iterations 251–500 (originally iterations 751–1000).
We now have m = 10 chains, each of length n = 250.
15
Assess mixing using between- and within-sample variances
For each scalar estimand ψ, we label the simulations as ψij
(i = 1,…,n;j = 1,…m), and we compute B and W, the between- and within-sequence variances:
B = and
ψ ̄ − ψ ̄ m−1 .j ..
j=1
, where ψ ̄ = ψ , ψ ̄ = ψ ̄ .j n ij .. m .j
i=1 j=1
nm2 1n 1m
1n2 21n ̄2 W=m sj, where sj=n−1 ψij−ψ.j .
j=1 i=1
The between-sequence variance, B, contains a factor of n be-
cause it is based on the variance of the within-sequence means,
ψ ̄ , each of which is an average of n values ψ . .j ij
16
We can estimate var(ψ|y), the marginal posterior variance of the estimand, by a weighted average of W and B, namely
var+(ψ|y) = n − 1W + 1B . nn
This quantity overestimates the marginal posterior variance as- suming the starting distribution is appropriately overdispersed, but is unbiased under stationarity (that is, if the starting distri- bution equals the target distribution), or in the limit n → ∞.
Meanwhile, for any finite n, the ‘within’ variance W should be an underestimate of var(ψ|y) because the individual sequences have not had time to range over all of the target distribution and, as a result, will have less variability; in the limit as n → ∞, the expectation of W approaches var(ψ|y).
17
We monitor convergence of the iterative simulation by estimat- ing the factor by which the scale of the current distribution for ψ might be reduced if the simulations were continued in the limit n → ∞. This potential scale reduction is estimated by
var+(ψ|y)
Rˆ =
W
which declines to 1 as n → ∞. If the potential scale reduction is high, then we have reason to believe that proceeding with further simulations may improve our inference about the target distribution of the associated scalar estimand.
18
Effective number of simulation draws
Once the simulated sequences have mixed, we can compute an approximate ‘effective number of independent simulation draws’ for any estimand of interest ψ.
One way to define effective sample size for correlated simulation
draws is to consider the statistical efficiency of the average of the
simulations, ψ ̄ , as an estimate of the posterior mean, E(ψ|y). ..
It is usual to compute effective sample size using the following asymptotic formula for the variance of the average of a correlated sequence:
∞
lim mnvar(ψ ̄)=1+2
n→∞ .. t
ρvar(ψ|y) where ρt is the autocorrelation of the sequence ψ at lag t.
19
t=1
If the n simulation draws from each of the m chains were in- dependent, then var(ψ ̄ ) would simply be 1 var(ψ|y) and the
.. mn sample size would be mn.
In the presence of correlation we then define the effective sample size as
neff = mn . 1 + 2 ∞t=1 ρt
To compute the effective sample size we need an estimate of the
sum of the correlations ρ, for which we use information within and
between sequences. We start by computing the total variance
using the estimate var+; we then estimate the correlations by first computing the variogram Vt at each lag t:
1mn 2
Vt = ψi,j − ψi−t,j . m(n − t) j=1 i=t+1
20
We then estimate the correlations by inverting the formula E(ψi − ψi−t)2 = 2(1 − ρt)var(ψ):
ρˆ=1− Vt . t
Unfortunately we cannot simply sum all of these to estimate neff; the difficulty is that for large values of t the sample correlation is too noisy.
Instead we compute a partial sum, starting from lag 0 and con- tinuing until the sum of autocorrelation estimates for two suc- cessive lags is negative. We use this positive partial sum as our estimate of ∞t=1 ρt.
21
2var+
Putting this all together yields the estimate nˆeff = mn ,
t
where T is the first positive integer for which ρˆ
negative.
1+2T ρˆ
All these calculations should be performed using only the saved iterations, after discarding the warm-up period. For example, suppose we simulate 4 chains, each of length 1000, and discard the first half of each as warm-up. Then m = 8, n = 250, and we compute variograms and correlations only for the saved iterations (thus up to a maximum lag of 249, although in practice the stopping point T will be much lower).
22
t=1
+ ρˆ is T+1 T+2
Stopping the simulations
We monitor convergence for the entire multivariate distribution, p(θ|y), by computing the potential scale reduction factor Rˆ and the effective sample size nˆeff for each scalar summary of interest.
We recommend computing the potential scale reduction for all scalar estimands of interest; if Rˆ is not near 1 for all of them, continue the simulation runs.
We can use the effective sample size nˆeff to give us a sense of the precision obtained from our simulations. As discussed above, for many purposes it should suffice to have 100 or even 10 independent simulation draws (if neff = 10, the simulation standard error is increased by 1 + 1/10 = 1.05).
23
As a default rule, we suggest running the simulation until nˆeff is at least 5m, that is, until there are the equivalent of at least 10 independent draws per sequence (recall that m is twice the number of sequences, as we have split each sequence into two parts so that Rˆ can assess stationarity as well as mixing).
Once Rˆ is near 1 and nˆeff is more than 10 per chain for all scalar estimands of interest, just collect the mn simulations (with warm- up iterations already excluded) and treat them as a sample from the target distribution.
24
Even if an iterative simulation appears to converge and has passed all tests of convergence, it still may actually be far from convergence if important areas of the target distribution were not captured by the starting distribution and are not easily reachable by the simulation algorithm.
When we declare approximate convergence, we are actually con- cluding that each individual sequence appears stationary and that the observed sequences have mixed well with each other. These checks are not hypothesis tests. There is no p-value and no statistical significance. We assess discrepancy from convergence via practical significance (or some conventional version thereof, such as Rˆ > 1.1).
25