Bayesian model selection
Consider the regression problem, where we want to predict the values of an unknown function y: Rd R given examples D xi,yiNi1 to serve as training data. In Bayesian linear
regression, we made the following assumption about yx:
yx xw x, 1
where x is a now explicitlywritten feature expansion of x. We proceed in the normal Bayesian way: we place Gaussian priors on our unknowns, the parameters w and the residuals , then derive the posterior distribution over w given D, which we use to make predictions.
One question left unanswered is how to choose a good feature expansion function x. For example, a purely linear model could use x 1, x, whereas a quadratic model could use x 1, x, x2, etc. In general, arbitrary feature expansions are allowed. How can I select between them? Even more generally, how do I select whether I should use linear regression or a completely different probabilistic model to explain my data? These are questions of model selection, and naturally there is a Bayesian approach to it.
Before we continue our discussion of model selection, we will first define the word model, which is often used loosely without explicit definition. A model is a parametric family of probability distributions, each of which can explain the observed data. Another way to explain the concept of a model is that if we have chosen a likelihood pD for our data, which depends on a parameter , then the model is the set of all likelihoods each one of which is a distribution over D for every possible value of the parameter .
In the case of linear regression, the weight vector w defines the parametric family, and the model is the set of distributions
py X, w, 2 N y; Xw, 2I,
indexed by all possible w. Each one of these is a potential explanation of the observed values y given X. In the case of flipping a coin n times with an unknown bias and observing the number of heads x, the model is
px n, Binomialx, n, ,
where there is one binomial distribution for every possible 0, 1. In the Bayesian method, we maintain a belief over which elements in the model we consider plausible by reasoning about p D via Bayes theorem.
Suppose now that I have at my disposal a finite set of models Mii that I may use to explain my observed data D, and let us write i for the parameters of model Mi. How do we know which model to prefer? We work out the posterior probability over the models via Bayes theorem! We have:
pD Mi PrMi PrMi D j pD MjPrMj.
Here PrMi is a prior distribution over models that we have selected; a common practice is to set this to a uniform distribution over the models. The value pD Mi may also be written in a morefamiliar familiar form:
pD Mi
pD i, Mipi Mi di.
1
This is exactly the denominator when applying Bayes theorem to find the posterior pi D, Mi! pD i, Mipi Mi pD i, Mipi Mi
pi D,Mi pD i,Mipi Midi. pD Mi ,
where we have simply conditioned on Mi to be explicit. In the context of model selection, the term pD Mi is known as the model evidence or simply evidence. One interpretation of the model evidence is the probability that your model could have generated the observed data, under the chosen prior belief over its parameters i.
Suppose now that we have exactly two models for the observed data that we wish to compare: M1 and M2, with corresponding parameter vectors 1 and 2 and prior probabilities PrM1 and PrM2. In this case it is easiest to compute the posterior odds, the ratio of the models probabilities given the data:
PrM1 D PrM1pD M1 PrM1 pD 1, M1p1 M1 d1 PrM DPrMpDMPrMpD,Mp Md ,
2 22222222
which is simply the prior odds multiplied by the ratio of the evidence for each model. The latter quantity is also called the Bayes factor in favor of M1. Publishing Bayes factors allows another practitioner to easily substitute their own model priors and derive their own conclusions about the models being considered.
Example
Wikipedia gives a truly excellent example of Bayesian model selection in practice.1 Suppose I am presented with a coin and want to compare two models for explaining its behavior. The first model, M1, assumes that the heads probability is fixed to 12. Notice that this model does not have any parameters. The second model, M2, assumes that the heads probability is fixed to an unknown value 0, 1, with a uniform prior on : p M2 1 this is equivalent to a beta prior on with 1. For simplicity, we choose a uniform model prior: PrM1 PrM2 12.
Suppose we flip the coin n 200 times and observe x 115 heads. Which model should we prefer in light of this data? We compute the model evidence for each model. The model evidence for M1 is quite straightforward, as it has no parameters:
200 1
Prx n, M1 Binomialn, x, 12 115 2200 0.005956.
The model evidence for M2 requires integrating over the parameter :
Prx n,M2 Prx n,,M2p M2d 1 200 115 200115
1151 d 0
1 0.004975. 201
The Bayes factor in favor of M1 is approximately 1.2, so the data give very weak evidence in favor of the simpler model M1.
1 http:en.wikipedia.orgwikiBayesfactorExample 2
An interesting aside here is that a frequentist hypothesis test would reject the null hypothesis 1 at the 0.05 level. The probability of generating at least 115 heads under model M1
2
is approximately 0.02 similarly, the probability of generating at least 115 tails is also 0.02, so a twosided test would give a pvalue of approximately 4.
Occams razor
One spin on Bayesian decision theory is that it automatically gives a preference towards simpler models, in line with Occams razor. One way to see this is to consider the model evidence pD M as a probability distribution over datasets D. More complex models can explain more datasets, so the support of this distribution is wider in the sample space. But note that the distribution must normalize over the sample space as well, so we pay a price for generality. When moving from a simpler model to a more complex model, the probability of some datasets that are well explained by the simpler model must inevitably decrease to give up probability mass for the newly explained datasets in the widened support of the morecomplex model. The model selection process then drives us to select the model that is just complex enough to explain the data at hand, an inbuild Occams razor.
In the coin flipping example above, model M1 can only explain datasets with empirical heads probability reasonably near 1 . An observation of 200 heads, for example, would have astronomically
2
small probability under this model. The second model M2 can explain any set of observations by
selecting an appropriate . The price for this generality, though, is that datasets with a roughly equal number of heads and tails have a smaller prior probability under the model than before.
Model selection for Bayesian linear regression
A common application for model selection is for selecting between feature expansion functions x in Bayesian linear regression. Here the model Mi could for example correspond to orderi polynomial regression with
ix 1,x,x2,…xi.
After selecting a set of these models to compare, as well as a prior probability for each, the only remaining task is to compute the evidence for each model in observed data X, y. In our discussion of Bayesian linear regression, we have actually already computed the desired quantity:
py X, 2, Mi N y; iX, iXiX 2I, where I have explicitly written the basis expansion in i.
Note that the model i can also easily explain all datasets wellexplained by the models j for j i, by simply setting the weights on higherorder terms to zero. Again, however, the simpler model will be preferred due the Occams razor effect described above.
Bayesian Model Averaging
Note that a full Bayesian treatment of a problem would eschew model selection entirely. Instead, when making predictions, we should theoretically use the sum rule to marginalize the unknown model, e.g.:
py x,D py x,D,MiPrMi D. i
Such an approach is called Bayesian model averaging. Although this is sometimes seen, model selection is still used widely in practice. The reason is that the computational overhead of using a
3
single model is much lower than having to continually retrain multiple models, and that Bayesian model averaging uses a mixture distribution for predictions, which can have annoying analytic properties for example, the predictive distribution could be multimodal.
4