STAT314/461 Bayesian Inference
STAT314/461 Bayesian Inference
patrick. .nz
room tba
September, 2021
1 / 32
Things to discuss
lab this week
office hours
Week 6
2 / 32
What we will cover this term
not necessarily in exactly this order
Bayesian computation – from direct simulation to Markov Chain
Monte Carlo methods.
rejection sampling
importance sampling
Metropolis-Hastings algorithm
Hierarchical Models
Problems of the ”missing data type”
Computation will be in R
Assignments
20/09 – due 1/10 max 10% for STAT314
4/10 – due 15/10, max 10% for STAT314
exact dates may change depending on how fast we go.
Best 4 of 5 assignments will count for your grade.
3 / 32
Things we need to know about aside from Bayesian
Inference
logistic regression
multivariate normal distribution (see dmvnorm() in the mvtnorm
package)
R programming:
1 For loops
2 While loops
3 functions – using and writing?
setting priors
4 / 32
Recap on Bayesian Inference
5 / 32
Recap on Bayesian Inference: Some big Bayesian ideas
What is Bayesian modelling?
An approach to statistical modelling which makes explicit use of
probability models for
observables (data)
parameters of models
Uses the data actually observed to update the prior distribution
(pre-data) for the parameters to a posterior distribution (post-data);
p(θ|data)
“Statistics is the study of uncertainty; Uncertainty should be
measured by probability.” ( , 2000)
6 / 32
Review some basics
The aim is always p(unknown|data)
aka the posterior distribution – think “post-data” distribution
computation of the posterior requires:
a model for the data
a model for the parameters of the data model – the “prior” or “prior
model”
(model ≡ statistical model (recognises uncertainty))
the prior (think “pre-data” or “pre – current data) allows
information about the problem that is not in the data to be brought
to the analysis
7 / 32
Some features of Bayesian modelling and inference
A full distribution is available for inference; not just point estimates,
standard errors and intervals.
So its easy to make inferences like Pr(θ > c |data) = 0.1.
Given a posterior distribution for model parameters, inferences for
observables not yet seen follows straightforwardly from the rules of
probability theory.
8 / 32
The idea of using a distribution for inference
Suppose we are interested in estimating a binomial proportion; e.g
proportion unemployed:
9 / 32
Bayesian Inference: More formally
Suppose Y denotes some variable of interest, e.g. income
Yi – value of Y for i
th individual
Y collection of Y values, Y = (Y1,Y2, . . .)
We model Y as a random variable with distribution
p(Y|θ) = fY(Y|θ) (1)
Usually assume conditional independence so
p(Y|θ) =
∏
i
p(Yi |θ) =
∏
fY (Yi |θ) (2)
We observe Yobs = (Y1,Y2, . . . ,Yn)
The first job for a Bayesian is to compute
p(θ|Yobs) =
p(Yobs |θ)p(θ)∫
p(Yobs |θ)p(θ) dθ
∝ p(Yobs |θ)p(θ) (3)
10 / 32
Bayesian Inference: Likelihood and Prior
p(θ|Yobs) ∝p(Yobs |θ)p(θ) (4)
Likelihood× Prior
Likelihood – probability density for observing the data actually seen,
under the assumed model; a function of θ. Compute it using the model
p(Y|θ) = fY(Y|θ).
Under conditional independence p(Y|θ) =
∏i=n
i=1 fY (Yi |θ).
e.g. suppose Yobs = (400, 1012, 961); assuming conditional
independence (given θ), the likelihood function is
p(Yobs |θ) = fY (400|θ)× fY (1012|θ)× fY (961|θ) (5)
Prior – probability model for θ which represents information about θ that
is not contained in the observed data.
11 / 32
Bayesian Inference for observables not yet seen
e.g. for
future observations
missing observations
Just use probability theory to compute
p(Ynew |Yobs) =
∫
p(Ynew |Yobs , θ)p(θ|Y obs) dθ (6)
Can think of integration as Monte Carlo simulation
draw θ from p(θ|Yobs)
draw Ynew from p(Ynew |Yobs , θ). Under conditional independence
this amounts making independent draws from the data model
p(Y |θ) = fY (Y |θ).
12 / 32
Likelihood, prior and Bayesian computation
p(θ|Yobs) =
p(Yobs |θ)p(θ)∫
p(Yobs |θ)p(θ) dθ
(7)
But notice we have been emphasising the numerator of (7),
sometimes called the unnormalised posterior
q(θ|Yobs) = p(Yobs |θ)p(θ).
The Monte Carlo methods we will study generally only need the
unnormalised posterior.
This is a good thing:
1 We specify the prior; we derive the likelihood from the model for the
data that we specify. Hence the unnormalised posterior can always be
written down and computed.
2 In realistic applications the integration in the denominator can be
high-dimensional and difficult.
13 / 32
Introduction to Monte Carlo methods for Bayesian
Computation
14 / 32
Some basic ideas in the use of Monte Carlo methods for
posterior computation (1)
Suppose we can simulate from p(θ|Yobs)
Let h(θ) be some function of θ
We can approximate E (h(θ)|Yobs) =
∫
h(θ)p(θ|Yobs)dθ using the
following Monte Carlo (MC) algorithm
(i) for i in 1 : nsim
draw θ(i) from p(θ|Yobs)
compute h(i) = h(θ(i))
store θ(i), h(i)
(ii) set
Ê (h(θ)|Yobs) =
∑nsim
i=1 h(i)
nsim
15 / 32
Basic ideas in Monte Carlo computation (2)
Many things we are interested in are integrals and can be written as
expectations of h(θ) for some choice of h(θ) :
Expectation: h(θ) = θ
E (θ|Yobs) =
∫
θp(θ|Yobs)dθ
Variance: h(θ) = (θ − E (θ|Yobs))2
V (θ)|Yobs) =
∫
(θ − E (θ|Yobs))2p(θ|Yobs)dθ
Tail probability e.g. h(θ) = Ic(θ), where Ic(θ) = 1 if θ ≤ c ; Ic(θ) = 0 if
θ > c .
Pr(θ < c|Yobs) = ∫ c ∞ p(θ|Yobs)dθ = ∫ Ic(θ)p(θ|Yobs) dθ 16 / 32 Basic ideas in Monte Carlo computation:(2b) Interval Probability h(θ) = I(a,b)(θ) = 1 if a ≤ θ ≤ b; I(a,b) = 0, otherwise. Pr(a ≤ θ ≤ b|Yobs) = ∫ b a p(θ|Yobs)dθ (8)∫ I(a,b)(θ)p(θ|Y obs) dθ (9) So, generate θ = (θ1, . . . , θM) from the posterior, count up the number of theta values falling in (a, b) and divide by M. As b → a this gets pretty close to density estimation. e.g. plot(density(rgamma(n=1000,shape=alpha+sumY, rate=beta+sumN)) General idea, in practice, is to draw values of θ from p(θ|Yobs) and approximate features of the posterior by the corresponding features of the sample of θ values. 17 / 32 Compare Monte Carle density estimate with exact density 18 / 32 Basic ideas in Monte Carlo computation (3) Recall: p(θ|Yobs) = p(Yobs |θ)p(θ)∫ p(Yobs |θ)p(θ) dθ (10) Setting h(θ) = p(Yobs |θ), i.e the likelihood, we can see that the normalising constant (denominator) in the posterior also has the form of an expectation, but over the prior. So this is a bit different from the other examples where integration is over the posterior for θ. In principle we could apply the standard MC algorithm to approximate K (Yobs) = ∫ p(Yobs |θ)p(θ) dθ by repeatedly sampling values from p(θ), computing p(Yobs |θ) = ∏n i=1 p(Yi |θ) for each generated theta, and taking the average. But ∏n i=1 p(Yi |θ) rapidly gets very small, which leads to numerical problems. We need to be a bit more clever to evaluate the normalizing constant (fortunately we, mostly, don’t need to explicitly) 19 / 32 Aside: Laplace approximation to the normalising constant p(θ|Yobs) = p(Yobs |θ)p(θ)∫ p(Yobs |θ)p(θ) dθ = q(θ|Yobs∫ q(θ|Yobs) (11) Let u(θ) = log(q(θ)) and suppose u(θ) is maximised at θ = θ̂, so θ̂ is the posterior mode. Let u′′(θ) = d 2 du(θ)2 and Let r denote the dimension of θ K (Yobs) = ∫ (q(θ|Yobs))dθ ≈ q(θ̂|Yobs)(2π)r/2| − u′′(θ̂)|−1/2 (12) In practice we would compute log(K (Yobs)) ≈ u(θ̂|Yobs) + (r/2) log(2π)− 0.5 log | − u′′(θ̂)| (13) = n∑ i=1 log(p(Yi |θ̂)) + log(p(θ̂)) + (r/2) log(2π)− 0.5 log | − u′′(θ̂| (14) can often then work with log(p(θ|Yobs)) ≈ log(q(θ|Yobs))− log(K̂ (Yobs)) where log(K̂ (Yobs)) is rhs of (14 ) 20 / 32 Basic ideas of Monte Carlo computation (4): Work on the log-scale for intermediate calculations To avoid numerical problems we will usually work on the loge scale when evaluating likelihoods and posterior distributions; leaving exponentiating as late as possible. For example, in the Metropolis-Hastings algorithm, to be considered later, it is necessary to evaluate the a ratio of posterior densities computed at different values of θ. Instead of working with rMH = p(θ1|Y)/p(θ2|Y) directly we work with log(rMH) = log(p(θ1|Y))− log(p(θ2|Y)) e.g. Instead of determining (rMH > 1) we determine log(rMH > 0)
21 / 32
Basic ideas in Monte Carlo computation (5): Monte Carlo
error
Yes, Monte Carlo methods provide approximations to posterior
distributions.
Monte Carlo error is under the control of the analyst – reduce error
by increasing the Monte Carlo sample size!
We can get a sense of the Monte Carlo error.
For simple Monte Carlo methods this is fairly straightforward, at
least for the expected value (of θ or h(θ) ).
if θnsim = (θ(1), θ(2), . . . θnsim) are nsim independent draws from the
posterior and s(θnsim) is the standard deviation of the draws, then
s(θnsim)/
√
nsim is a reasonable approximation to the Monte Carlo
error, for the expectation.
22 / 32
Basic ideas in Monte Carlo computation (6): More on
Monte Carlo error
s(θnsim) is an approximation to the posterior standard deviation, so
V (θnsim) = s(θnsim)
2 is an approximation to the posterior variance.
So, in a loose sense, we can think of
V tot(θnsim) = V (θnsim) +
V (θnsim)
nsim
(15)
= V (θnsim)(1 +
1
nsim
) (16)
as a measure of total uncertainty about θ
For nsim = 100, total uncertainty is about 1% greater than posterior
uncertainty; Total standard deviation is about
√
1 + 1/100 ≈ 0.5%
greater than posterior standard deviation.
Monte Carlo error can be often be a small contributor to total
uncertainty.
23 / 32
Basic ideas in Monte Carlo computation (7): Some caveats
Caveat 1: – approximating extreme posterior quantiles requires a
bigger Monte Carlo sample size than approximating the posterior
expectation,
Caveat 2: The above arguments are based on the premise that we
can draw ns im values independently from the posterior. The more
complex Monte Carlo methods we will look at later generate
correlated draws and approximation of the Monte Carlo standard
error in this case is more difficult. For a given nsim, the Monte Carlo
standard error will be greater for correlated draws than for
independent draws.
24 / 32
Applications of direct Monte carlo simulation
25 / 32
Direct simulation of the posterior of a function of random
variables Ex (1)
Consider a simple two-group study: people randomly allocated to group
A or B
Group A – 10 people given drug A; 7 successes (respond to
treatment)
Group B – 10 people given drug B; 3 successes
A model for these data:
Conditional Independence:
p(YA,YB |θA, θB ,NA,NB) = p(YA|θA,NA)p(YB |θB ,NB)
YA ∼ Binomial(θA,Na)
YB ∼ Binomial(θB ,Nb)
Suppose p(θA, θB) = p(θA)p(θB) and θA ∼ Beta(1, 1), θB ∼ Beta(1, 1)
Suppose further that we are interested in the relative risk, rr = θA/θB .
26 / 32
Simulating the posterior of a function of random variables
(2)
Ultimately we want p(rr|data) where here data is (YA,YB) and we
regard as NA,NB as known constants, fixed by the investigator.
– first step is to obtain the posterior for the model parameters
p(θA, θB |data) ∝ p(YA,YB |θA, θB ,NA,NB)p(θA)p(θB)
= [p(YA|θA,NA)p(θA)]× [p(YB |θB ,NB)p(θB)]
= [Binomial(YA|θA,NA)][Beta(θA|1, 1)]
× [Binomial(YB |θB ,NB)][Beta(θB |1, 1)]
= [Beta(θA|YA + 1,NA − YA + 1)]×
× [Beta(θB |YB + 1,NB − YB + 1)]
27 / 32
Simulating the posterior of a function of random variables
(3)
p(θA, θB |data) = [Beta(θA|YA + 1,NA − YA + 1)]×
× [Beta(θB |YB + 1,NB − YB + 1)] (17)
It is easy to simulate the joint posterior for (θA, θB). Just draw values
independently from the appropriate Beta distributions. Almost as easy to
simulate the posterior of rr = θA/θB :
for (i in 1:nsim) {
draw θ
(i)
A from Beta(YA + 1,NA − YA + 1)
draw θ
(i)
B from Beta(YB + 1,NB − YB + 1)
set rr (i) = θ
(i)
A /θ
(i)
B
store rr (i)
}
Summarise stored rr values (see R script posterior_rr.r)
28 / 32
Direct simulation of function of parameters: Ex. 2 Cancer
rates by DHB
see R markdown file cancer_example.pdf and /or
cancer_example.rmd
If λi is the underlying rate of tongue cancer for the i
th DHB we can,
using Monte Carlo simulation, compute quantites like
Pr(λi > λj , ∀j 6= i |Y) i.e the posterior probability that the i th DHB
has the largest underlying cancer rate among all DHBs. We can
compute this probability for each DHB.
We can also compute the posterior distribution for the rank-order of
DHBs, wrt tongue cancer rates.
These types of inferences are difficult to obtain analytically
29 / 32
Useful probability decompositions for Monte Carlo
simulation (and Bayesian Inference in general)
30 / 32
Direct simulation for a vector of random variables
We can always write a joint distribution as a product of a marginal and
conditional distribution, e.g.
p(Y1,Y2) = p(Y1)p(Y2|Y1)
Extends automatically to vectors or arbitrary length:
p(Y1,Y2, . . . ,Yk) = p(Y1)p(Y2|Y1)p(Y3|Y2,Y1), . . . , p(Yk |Yk−1, . . . ,Y1)
And, of course this holds for any vector of r.v’s, not just observables, so
we might sometimes model a parameter vector as
p(θ1, θ2, . . . , θk) = p(θ1)p(θ2|θ1)p(θ3|θ2, θ1), . . . , p(θk |θk−1, . . . , θ1)
The point is we are breaking a multivariate distribution into a sequence
of univariate (more generally, low-dimensional) distributions – often easier
to model and simulate. If you are lucky you can take advantage of
conditional independencies to simplify further.
31 / 32
Sequence of conditionals approach also applies to a
combination of parameters and observables; and holds
conditionally
p(θ,Yn+1|Y1, . . . ,Yn) = p(θ|Y1, . . . ,Yn)p(Yn+1|θ,Y1 . . . ,Yn) (18)
If Y1, . . . ,Yn are the observed data we will of often write expressions
such as (18) as
p(θ,Yn+1|data) = p(θ|data)p(Yn+1|θ,data). (19)
If the Y ′s are modelled as conditionally independent given θ :
p(Y1,Y2, . . . |θ) =
∏
i
p(Yi |θ) (20)
(19) simplifies to
p(θ,Yn+1|data) = p(θ|data)p(Yn+1|θ) (21)
32 / 32
Recap on Bayesian Inference
Introduction to Monte Carlo methods for Bayesian Computation
Applications of direct Monte carlo simulation
Useful probability decompositions for Monte Carlo simulation (and Bayesian Inference in general)