Model Checking
Bayesian Statistics Statistics 4224/5224 Spring 2021
February 18, 2021
1
Homework 3 hints and suggestions
1. Problem 1 concerns a beta-binomial hierarchical model. You can easily adapt the R code given in ‘Example05a’ to solve much of this problem. As in that example, I suggest you reparameterize from (α, β) to log(α/β), log(α + β)) for pos- terior sampling based on a grid approximation. You should find the values outside the ranges −1.5 < log(α/β) < 1.5 and 0 < log(α + β) < 6 can safely be ignored (have posterior probability of essentially zero).
2. Problem 2 concerns a hierarchical normal model with known variance, like the one estimated in ’Example05b,’ which con- tains most of the R code you will need to solve this problem.
2
3. Problem 3 concerns an ordinary (nonhierarchical) Poisson- gamma model. Part (a) can be answered ‘exactly’ using the qgamma function, parts (b) and (c) are best accomplished by Monte Carlo sampling.
4. Problem 4 involves posterior predictive checks.
5. Bayes factor. An analytic solution is available, but I sug- gest you evaluate the integrals numerically, perhaps using the Monte Carlo method. The first step is to figure out exactly what those integrals are, upon which
BF(H2; H1) = p(y1, y2|H2) p(y1, y2|H1)
depends.
3
The remainder of these notes summarize Chapter 6 of Bayesian Data Analysis, by Andrew Gelman et al.
A less technical discussion of model checking by posterior predic- tive checks is given in Section 5.6 of Bayesian Statistical Meth- ods, by Brian J. Reich and Sujit K. Ghosh.
The place of model checking in applied Bayesian statistics
Once we have accomplished the first two steps of a Bayesian analysis—constructing a probability model and computing the posterior distribution of all estimands—we should not ignore the relatively easy step of assessing the fit of the model to the data and to our substantive knowledge.
4
Checking the model is crucial to statistical analysis. Bayesian prior-to-posterior inference assumes the whole structure of a probability model and can yield misleading inferences when the model is poor.
A good Bayesian analysis, therefore, should include at least some check of the adequacy of the fit of the model to the data, and the plausibility of the model for the purposes for which the model will be used.
Judging model flaws by their practical implications
We do not like to ask, ‘Is our model true or false?’, since prob- ability models in most data analyses will not be perfectly true.
The more relevant question is, ‘Do the model’s deficiencies have a noticeable effect on the substantive inferences?’
5
Do the inferences from the model make sense?
External validation
We can check a model by external validation using the model to make predictions about future data, then collecting those data and comparing them to their predictions.
Posterior means should be correct on average, 50% intervals should contain the true values about half the time, and so forth.
Choices in defining the predictive quantities
A single model can be used to make different predictions.
Below we discuss posterior predictive checking, which uses global summaries to check the joint posterior predictive distribution p(y ̃|y).
6
Posterior predictive checking
If the model fits, then replicated data generated under the model should look similar to the observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.
Our basic technique for checking the fit of a model to data is to draw simulated values from the joint posterior predictive dis- tribution of replicated data, and compare those samples to the observed data. Any systematic differences between the simula- tions and the data indicate potential failings of the model.
It is often useful to measure the ‘statistical significance’ of the lack of fit, a notion we formalize below, as a ‘Bayesian p-value.’
7
Notation for replications
Let y be the observed data and θ be the vector of parameters.
Define yrep as the replicated data that could have been observed, or, to think predictively, as the data we would see tomorrow if the experiment that produced y today were replicated with the same model and the same value of θ that produced the observed data.
We will work with the distribution of yrep given the current state of knowledge, that is, with the posterior predictive distribution
p(yrep|y) = p(yrep|θ)p(θ|y)dθ .
8
Test statistics
We measure the discrepancy between model and data by defining test statistics, the aspects of the data we wish to check.
A test statistic, T (y), is a scalar summary of the data that is used as a standard when comparing data to predictive simulations.
Tail-area probabilities
Lack of fit of the data with respect to the posterior predictive distribution can be measured by the tail-area probability, or p- value, of the test statistic, computed using posterior simulations of yrep.
9
The Bayesian p-value
The Bayesian p-value is defined as the probability that the repli- cated data could be more extreme than the observed data, as measured by the test statistic:
pB = Pr(T (yrep) ≥ T (y)|y)
= IT (yrep)≥T (Y ) p(yrep|y) dyrep
= IT (yrep)≥T (y)) p(yrep|θ) p(θ|y) dθ dyrep , where I is the indicator function.
In practice, we usually compute the posterior predictive distribu- tion using simulation.
10
If we already have S simulations from the posterior density of θ, θ1,...,θS ∼ iid p(θ|y) ,
we just draw one yrep from the predictive distribution for each simulated θ
yrep s ∼ p(yrep|θs) for s = 1,...,S ;
we now have S draws from the posterior predictive distribution
p(yrep|y).
The estimated p-value is just the proportion of those S simula- tions for which the test statistic equals or exceeds its realized value, that is,
1 S
pB = S
IT(yrep s)≥T(y) .
s=1
11
Choosing test statistics
If T(y) does not appear to be consistent with the set of val- ues T (yrep 1), . . . , T (yrep S), then the model is making predictions that do not fit the data.
Because a probability model can fail to reflect the process that generated the data in any number of ways, posterior predictive p-values can be computed for a variety of test statistics in order to evaluate more than one possible model failure.
Ideally, the test statistics will be chosen to reflect aspects of the model that are relevant to the scientific purposes to which the inference will be applied.
12
Multiple comparisons
One might worry about interpreting the significance level of mul- tiple tests, or of tests chosen by inspection of the data.
We do not make any ‘multiple comparisons’ adjustment, because we use predictive checks to see how particular aspects of the data would be expected to appear in replications.
We are not concerned with ‘Type I error’ rate because we use the checks not to accept or reject a model, but rather to understand the limits of its applicability in realistic replications.
13
Interpreting posterior predictive p-values
A model is suspect if a discrepancy is of practical importance and its observed value has a tail-area probability near 0 or 1, indicating that the observed pattern would be unlikely to be seen in replications of the data if the model were true.
The relevant goal is not to answer the question ‘Do the data come from the assumed model?’ (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model’s own assumptions.
Example
Courseworks → Files → Examples → Ex06 ModelChecking . 14