19th November 2019
Bayesian methods
https://bitbucket.org/mfumagal/
statistical_inference
Matteo Fumagalli
Intended Learning Outcomes
At the end of this session you will be able to:
critically discuss advantages (and disadvantages) of Bayesian data analysis,
illustrate Bayes’ Theorem and concepts of prior and posterior distributions,
implement simple Bayesian methods in R, including sampling and approximated techniques,
apply Bayesian methods to solve problems in biology.
Intended Learning Outcomes
At the end of this day you will be able to:
appreciate the use of Bayesian statistics in life sciences,
formulate and explain Bayes’ theorem.
describe a Normal-Normal model and implement it in R with or without Monte Carlo sampling,
apply Bayesian statistics to estimate genotypes from DNA sequencing data.
Meet Nessie
Figure 1: Nessie, the Loch Ness Monster. True or fake news?
Likelihood for a monster to exist (!?)
D = {0, 1} be our data, whether I tell you I saw Nessie or not. θ = {0, 1} is the probability distribution for Nessie existing (or
not).
Questions
What are p(D = 1|θ = 1) and p(D = 1|θ = 0)? What is a Maximum Likelihood Estimate of N?
Likelihood thinking…
Our inference on θ is driven solely by our observations, given by our likelihood function.
Figure 2: The eye: a ”likelihood” organ.
Non-likelihood thinking…
In real life we take many decisions based not only on what we observe but also on some believes of ours.
Figure 3: The brain: a ”non-likelihood” organ.
Bayesian thinking
with ”eyes only” our intuition is that p(θ|D) ∝ p(D|θ) with ”the brain” our intuition is that p(θ|D) ∝ p(D|θ)p(θ)
Our ”belief” expresses the probability p(θ) unconditional of the data.
Question
How can we define p(θ)?
Prior and posterior probability
The ”belief” function p(θ) is called prior probability and the joint product of the likelihood p(D|θ) and the prior is proportional to the posterior probability p(θ|D).
The use of posterior probabilities for inferences is called Bayesian statistics.
Bayesian statistics
Bayesian statistics is an alternative to frequentist approaches but without a definite division as in many cases the approach taken is eclectic.
Figure 4: Ronald Fisher.
Publishing a paper
Example:
You submitted a manuscript for publication to a peer-reviewed journal and you want to assess its probability of being accepted and published.
Which information do you need (and use) to make such inference?
Measuring biodiversity
Example:
You are measuring the biodiversity of crabs on Scottish rock shores in four different locations over three years.
Table 1: Biodiversity levels.
Year Loc. A Loc. B Loc. C Loc. D 2016 45 54 47 52 2017 41 ? 43 45 2018 32 38 37 35
Question
What is a reasonable value for the missing entry?
Statistical inference
Frequentist (from data only)
Likelihoodist (using a statistical model)
Bayesian (using prior information)
Empirical Bayesian (observed data contribute to the prior)
Figure 5: The brain and the eye: an Empirical Bayesian organ.
Statistical inference
If D is the data and θ is your unknown parameter, then
the frequentist conditions on parameters and integrates over
the data, p(D|θ),
the Bayesian conditions on the data and integrates over the parameters, p(θ|D).
Bayesian vs. Likelihoodist:
we derive ”proper” probability distributions of our parameters
rather than deriving a point estimate,
a probability is assigned to a hypothesis rather than a hypothesis is tested,
we ”accept” the null hypothesis rather than ”fail to reject” it,
parsimony imposed in model choice rather than correcting for multiple tests.
History
Figure 6: Rev. Thomas Bayes Figure 7: Pierre-Simon, marquis de Laplace
Why?
Why is Bayesian statistics becoming so commonly used?
recent increased computing power
good frequentist properties
answers are easily interpretable by non-specialists already implemented in packages
Troubles with the p-value? John K. Kruschke (2010)
…the fundamental fatal flaw of p-values is that they are ill defined, because any set of data has many different p-values.
…many people mistake the p-value for the probability that the null hypothesis is true.
Troubles with the prior?
John K. Kruschke (2010)
Some people may have the mistaken impression that the advantages of Bayesian methods are negated by the need to specify a prior distribution.
It is inappropriate not to use a prior.
Priors are explicitly specified and must be agreeable to a skeptical scientific audience.
When different groups of scientists have differing priors, stemming from differing theories and empirical, then Bayesian methods provide rational means for comparing the conclusions from the different priors.
Why are WE using it?
Bayesian statistics is very used in many topics in life sciences: genetics (e.g. fine mapping of disease-susceptibility genes) ecology (e.g. agent-based models)
evolution (e.g. inference of phylogenetic trees) bioinformatics (e.g. genome assembly)
systems biology (e.g. gene networks)
…
and provides a rationale for other approaches (e.g. artificial neural networks).
The likelihood approach
Y is a random variable
f(y|θ) is a probability distribution (called the likelihood) representing the sampling model for the observed data (y1, y2, …, yn) given a vector of unknown parameters θ
f (y|θ)dθ is not necessarily = 1 or even finite
It is possible to find the value of θ that maximises the likelihood function: we can calculate a maximum likelihood estimate (MLE) for θ, as: θˆ = argmaxθf (y|θ)
Bayes’ Theorem
f (y |θ)π(θ) f (y |θ)π(θ)
p(θ|y) = m(y) = f (y|θ)π(θ)dθ (1)
θ is not a fixed parameter but a random quantity with prior distribution π(θ)
p(θ|y) is the posterior probability distribution of θ p(θ|y)dθ = 1
Bayes’ Theorem in action!
Figure 8: The chytrid fungus Batrachochytrium dendrobatidis is the most significant threat to amphibians.
Let’s prove Bayes’ Theorem!
Venn diagrams
Figure 9: Sets U, A (samples with infection), B (samples with positive test result) and A ∩ B (or AB).
Given that the test is positive for a randomly selected sample, what is the probability that said sample is infected?
Normal-Normal model
If
then
σ2μ + τ2y
f(y|θ) = N(y|θ,σ2) π(θ) = N(θ|μ,τ2)
σ2τ2
p(θ|y)=N(θ| σ2 +τ2 ,σ2 +τ2) (4)
(2) (3)
Normal-Normal model
”Shrinking” factor B
with 0 ≤ B ≥ 1. Then
σ2
B = σ2 +τ2 (5)
E(θ|y) = Bμ + (1 − B)y (6) Var(θ|y) = (1 − B)σ2 ≡ Bτ2 (7)
Normal-Normal model
”Shrinking” factor B
with 0 ≤ B ≥ 1. Then
σ2
B = σ2 +τ2 (5)
Question:
What if σ2 >> τ2? What if σ2 << τ2?
E(θ|y) = Bμ + (1 − B)y (6) Var(θ|y) = (1 − B)σ2 ≡ Bτ2 (7)
Save the frogs with a Normal-Normal model!
Infected frogs (observed and ”believed” a priori) Assume that y = 6, σ = 1, μ = 2 and τ = 1.
f(y =6|θ)=N(y =6|θ,1) (8) π(θ) = N(θ|2, 1) (9)
Save the frogs with a Normal-Normal model!
Infected frogs (observed and ”believed” a priori) Assume that y = 6, σ = 1, μ = 2 and τ = 1.
f(y =6|θ)=N(y =6|θ,1) (8) π(θ) = N(θ|2, 1) (9)
Exercise:
1 Open R and calculate and plot the prior, likelihood, and posterior distribution.
2 Calculate the maximum a posteriori probability (MAP).
3 What happens if we use a skewer (sharper) or wider prior?
4 What happens if we have more observations?
Monte Carlo sampling
Drawing random samples from the posterior distribution instead of calculating its parameters.
Figure 10: Monte Carlo and its famous casino.
Exercise:
1 Open R and plot the posterior distribution of the previous example using Monte Carlo sampling.
2 What happens if we use more or less samples?
Recap & Refresh
Why going Bayesian? p-values are troublesome. Theorem: p(θ|y)= f(y|θ)π(θ)
f(y|θ)π(θ)dθ
Normal-Normal model: p(θ|y) = N(θ|σ2μ+τ2y , σ2τ2 ) σ2+τ2 σ2+τ2
Monte Carlo sampling
Intended Learning Outcomes
At the end of this day you will be able to:
describe the pros and cons of using different priors (e.g. elicited, conjugate, ...);
evaluate the interplay between prior and posterior distributions using R,
calculate several quantities of interest from posterior distributions,
apply Bayesian inference to estimate population variation from DNA data.
Prior distributions
How can we decide which prior distribution is more appropriate in our study?
They are derived from past information or personal opinions from experts.
They are typically distributed as commonly used distribution families.
They can be limited to to bear little information. ...endless possibilities...
Elicited priors
Define the collection of θ which are possible, assign some probability to each one of these cases, make sure that they sum up to 1.
Elicited discrete priors
Figure 11: How many kits do rabbits have in one litter?
Knowing that ”rabbits can have anywhere from one to 14 babies in one litter with an average litter size of 6”, let’s build a prior distribution.
Elicited continuous priors
Figure 12: Bumpass Hell, hot springs and fumaroles at Lassen Volcanic National Park, California.
Knowing that ”from past observations, the temperature has a range of (80.1, 110.4) with an average of 88.3 Celsius degrees”, let’s build a prior distribution.
Elicited parametric priors
θ belongs to a parametric distributional family π(θ|ν).
Advantages:
reduces the effort to the elicitee,
overcomes the finite support problem,
may lead to simplifications in the computation of the posterior.
Disadvantage:
Elicited parametric priors
θ belongs to a parametric distributional family π(θ|ν).
Advantages:
reduces the effort to the elicitee,
overcomes the finite support problem,
may lead to simplifications in the computation of the posterior.
Disadvantage:
impossible to find a distribution that perfectly matches the elicitee’s beliefs.
Elicited parametric priors
0 for θ < 80.1 or θ > 110.4
π(θ) = N(μ, σ2) for 80.1 ≤ θ ≤ 110.4 (10) withμ=88.3andσ2 =10
Figure 13: Elicited prior distribution of water temperature.
How to build elicited priors
Focus on quantiles close to the middle of the distribution (e.g. the 50th, 25th and 75th) rather than extreme quantiles (e.g. the 95th and 5th),
assess the symmetry,
priors can be updated and reassessed as new information is available,
useful for experimental design and data exploration.
Conjugate priors
π(θ) is member of a family which is conjugate with the likelihood f(y|θ) so that the posterior distribution p(θ|y) belongs to the same distributional family as the prior.
Conjugate priors
Example: Y is the count of distinct elephant herds arriving at the pool in a day during the migration season.
Figure 14: Elephants drinking at the pool. What’s the arrival rate for distinct herds?
Poisson and elephants
Poisson distribution is an appropriate model for Y if:
1 Y is the number of times an event occurs in an interval and it
can take values any positive integer values including 0;
2 the occurrence of one event does not affect the probability that a second event will occur (i.e. events occur independently);
3 the rate at which events occur is constant (it cannot be higher in some intervals and lower in other intervals);
4 two events cannot occur at exactly the same instant;
5 the probability of an event in an interval is proportional to the length of the interval.
Poisson distribution
If θ is the event rate (rate parameter), then the probability of observing y events in an interval is:
e−θθy
f(y|θ)= y! ,y∈{0,1,2,…},θ>0 (11)
which is the probability mass function (pmf) for a Poisson distribution.
Let’s plot the distribution for θ = 4 using R.
Poisson distribution
Figure 15: Poisson distribution for θ = 4. This is the likelihood distribution for the number of herds per day with a rate of 4.
Conjugate prior for Poisson distribution
Prior: Gamma distribution
θα−1e−θ/β
π(θ)= Γ(α)βα ,θ>0,α>0,β>0 (12)
E[G(α,β)] = αβ (13) Var[G(α,β)] = αβ2 (14)
Gamma distribution
Figure 16: Gamma distribution for different values of shape and rate parameters.
Gamma + Poisson = ?
p(θ|y ) ≈ f (y |θ)π(θ) (15) ≈? (16)
Gamma + Poisson = Gamma’
p(θ|y ) ≈ f (y |θ)π(θ) (17) ≈ (e−θθy )(θα−1e−θ/β) (18)
= θy+α−1e−θ(1+1/β) (19) p(θ|y) ∼G(α′,β′)withα′ =y+αandβ′ =(1+1/β)−1
Posterior is (another) Gamma distribution.
Conjugate priors allow for posterior distributions to emerge without numerical integration!
Elephants’ arrivals
Example:
we have some intuition that we expect to see 3 herds per day (prior)
we observed 4 herds (data)
Exercise:
What is the posterior distribution of θ, the average number of herds (Gamma-Poisson model)? Calculate and plot the distribution in R (use Monte Carlo sampling too).
Hierarchical modelling
Hyperpriors define the density distribution of hyperparameters.
f (y|θ)π(θ|ν)h(ν)dν
p(θ|y) = f (y|θ)π(θ|ν)h(ν)dνdθ (20)
Hierarchical modelling
Hyperpriors define the density distribution of hyperparameters. f (y|θ)π(θ|ν)h(ν)dν
p(θ|y) = f (y|θ)π(θ|ν)h(ν)dνdθ (20) Empirical Bayesian
Estimated posterior p(θ|y,νˆ) by replacing ν with an estimate νˆ obtained by maximising the marginal distribution m(y|ν).
If ν ∼ h(ν|λ) with unknown parameters λ, then we have a third-stage prior g(λ).
Non-informative priors
Can we use a Bayesian approach when no reliable prior information on θ is available?
Yes, a noninformative prior distribution for θ contains ”no information” about θ and all the information in the posterior will (mostly) arise from the data.
Noninformative priors
Discrete case:
If Θ⃗ = {θ1, θ2, …, θn}, then
with
p(θi) = 1, i = 1,2,…,n n
n 1 = 1 1n
Noninformative priors
Continuous and bounded case:
IfΘ⃗ =[a,b]with−∞ 0 is an improper distribution as
+∞
p(θ)dθ = +∞
−∞
Bayesian inference is still possible under some circumstances.
Noninformative priors
Figure 17: A uniform prior distribution for the arrival rate of elephant herds.
Rule out scenarios that are impossible in real life.
Lack a conjugate model (sampling methods are required). Non-informative priors are related to reference priors.
Bayesian inference
The posterior distribution of model parameters can be difficult to interpret.
We want to summarise the information enclosed in these distribution.
Point estimation
Example: Arrival rate with y = 1 and prior ∼ G (0.5, 1).
Question:
What is the (i) mean, (ii) mode and (iii) median of the resulting posterior distribution?
Point estimation
The mode is the easiest to calculate as we can work directly with the numerator.
If the prior distribution is flat then the posterior mode will be equal to the maximum likelihood estimate.
If the posterior distribution is symmetric, then the mean and the median are equivalent.
For symmetric unimodal distributions, all these three features are equivalent.
For asymmetric distributions, the median is often the best choice as it is less affected by outliers and it is an intermediate to the mode and the mean.
Point estimation
If we want to obtain a measure of accuracy of a point estimate θˆ(⃗y), we can calculate the posterior variance:
Eθ|⃗y (θ − θˆ)2 (21)
In the multivariate case the posterior mode is θ⃗ˆ(⃗y) = (θˆ1,θˆ2,…,θˆk).
Credible intervals
A 100×(1−α) credible set for θ is a subset C of Θ such that:
1 − α ≤ P(C|y) =
”The probability that θ lies in C given the observed data y is at least (1 − α)”
e.g. α = 0.05
C
p(θ|y)dθ (22)
Credible intervals
In continuous settings we can calculate the highest posterior density, or HPD, credible set, defined as:
C ={θ∈Θ:p(θ|y)≥k(α)} (23) where k(α) is the largest constant satisfying P(C|y) ≥ (1 − α).
Example: p(θ|y) ∼ G(2,1) and k(α) = 0.1.
Credible intervals
How can we summarise our results? the posterior mean
several posterior percentiles (e.g. 0.025, 0.25, 0.50, 0.75, 0.975)
a credible interval
posterior probabilities p(θ > c|y) where c is a notable point (e.g. 0, 1, depending on the problem)
a plot of the distribution to check whether it is unimodal, multimodal, skewed, …
Hypothesis testing
In the frequentist approach,
1 one formulates a null hypothesis H0 and an alternative
hypothesis Ha,
2 an appropriate test statistic is chosen T (Y ),
3 one computes the observed significance, or p-value, of the test as the chance that T(Y) is ”more extreme” that T(yobs), where the ”extremeness” is towards the alternate hypothesis,
4 if the p-value is less than some threshold, typically in the form of a pre-specified Type I error rate, H0 is rejected, otherwise it is not.
Hypothesis testing
Limits of frequentist approach:
1 only when two hypotheses are nested (e.g. H0 is a simplification of Ha and involves setting one parameter of Ha to some known constant value)
2 evidence against the null hypothesis (e.g. a large p-value does not mean that the two models are equivalent, but only that we lack evidence of the contrary; we don’t ”accept the null hypothesis” but ”fail to reject it”)
3 no direct interpretation as weight of evidence (but only as a long-term probability; p-values are not the probability that H0 is true!)
Hypothesis testing
In the Bayesian approach,
1 one can test as many models as desired, Mi , i = 1, …, m,
2 one calculates the posterior probability that each model is correct
3 one compares each pair of posterior probabilities.
Hypothesis testing
Suppose we have two models M1 and M2 for data Y and the two models have parameters θ1 and θ2.
With prior densities πi (θi ) and i = 1, 2, the marginal distributions of Y are:
p(y|Mi) =
f(y|θi,Mi)πi(θi)dθi (24)
We can calculate the posterior probabilities P(M1|y) and P(M2|y) = 1 − P(M1|y) for the two models.
Bayes factors
A Bayes factor (BF) is used to summarise these results, and it is equal to the ration of posterior odds of M1 to the prior odds of M1:
BF = P(M1|y)/P(M2|y) = p(y|M1) (25) P(M1)/P(M2) p(y|M2)
If the two models are a priori equally probable then:
BF = p(M1|y)/p(M2|y) (26)
which is the posterior odds of M1.
Bayes factors
Interpretation
BF captures the change in the odds in favour of model 1 (vs. 2) as we move from the prior to the posterior.
BF
1 to 3 3 to 20 20 to 150 > 150
Strength of evidence
not worth more than a bare mention positive
strong
very strong
Recap & Refresh
How to build elicited priors
Conjugate priors: normal-normal, poisson-gamma, beta-binomial
Point estimates (mean, mode, median) and HPD credible intervals
Model testing
Intended Learning Outcomes
At the end of this day you will be able to:
describe the use of asymptotic methods,
illustrate the utility of direct and indirect sampling methods,
evaluate the feasibility of Markov Chain Monte Carlo sampling,
implement simple indirect sampling methods in R.
Bayesian computation
The calculation of posterior distributions often involves the evaluation of complex high-dimensional integrals.
When a conjugate prior is not available or appropriate we can evaluate the posterior distribution with:
1 asymptotic methods for approximating the posterior density;
2 numerical integration.
Asymptotic methods
Bayesian Central Limit Theorem
When there are many data points p(θ|x) will be approximately normally distributed.
For large data points, the posterior can be approximated by a normal distribution with mean equal to the posterior mode and (co)variance (matrix) equal to minus the inverse of the second derivative matrix of the log posterior evaluated at the mode.
Asymptotic methods
Example:
Recalling the beta-binomial model with flat prior, p(θ|x) ∝ θx(1 − θ)n−x.
The approximation is given by: 1 take the log:
Asymptotic methods
Example:
Recalling the beta-binomial model with flat prior, p(θ|x) ∝ θx(1 − θ)n−x.
The approximation is given by:
1 takethelog: l(θ)=xlogθ+(n−x)log(1−θ)
2 take the derivative of l(θ) and set it to zero, obtaining
Asymptotic methods
Example:
Recalling the beta-binomial model with flat prior, p(θ|x) ∝ θx(1 − θ)n−x.
The approximation is given by:
1 takethelog: l(θ)=xlogθ+(n−x)log(1−θ)
2 take the derivative of l(θ) and set it to zero, obtaining θˆπ = x n
3 take the second derivative evaluated at θˆ, obtaining
Asymptotic methods
Example:
Recalling the beta-binomial model with flat prior, p(θ|x) ∝ θx(1 − θ)n−x.
The approximation is given by:
1 takethelog: l(θ)=xlogθ+(n−x)log(1−θ)
2 take the derivative of l(θ) and set it to zero, obtaining θˆπ = x n
3 take the second derivative evaluated at θˆ, obtaining −n − n θˆ 1−θˆ
4 take the minus inverse, obtaining
Asymptotic methods
Example:
Recalling the beta-binomial model with flat prior, p(θ|x) ∝ θx(1 − θ)n−x.
The approximation is given by:
1 takethelog: l(θ)=xlogθ+(n−x)log(1−θ)
2 take the derivative of l(θ) and set it to zero, obtaining θˆπ = x n
3 take the second derivative evaluated at θˆ, obtaining −n − n θˆ 1−θˆ
4 take the minus inverse, obtaining θˆ(1−θˆ) n
5 p(θ|x) ∼ N(θˆπ, θˆ(1−θˆ)) n
Asymptotic methods
Ifp(θ|x)∝θx(1−θ)n−x thenforlargenwehave p(θ|x) ∼ N(θˆπ, θˆ(1−θˆ)).
n
Exercise:
k = 20
n = 100
π(θ) = G(1,1)
Compare the exact and approximated posterior (e.g. use qqplot). What happens if n = 10?
Asymptotic methods
Model approximations or first order approximations: the estimate θ by the mode and the error goes to 0 at a rate proportional to 1/n.
The estimates of moments and quantiles may be poor if the posterior differs from normality.
The Laplace’s Method provides a second order approximation to the posterior mean, with an error that decreases at a rate 1/n2.
Asymptotic methods
Advantages:
they replace numerical integration with numerical differentiation,
they are deterministic (without elements of stochasticity).
they reduce the computational complexity if any study of robustness (how sensitive are our conclusions to changes in the prior/likelihood?).
Asymptotic methods
Disadvantages:
they require that the posterior is unimodal,
the require that the size of the data is large (how large is ”large enough”?),
for high high-dimensional parameters the calculation of Hessian matrices (second derivatives) are hard.
Noniterative Monte Carlo methods
If θ ∼ h(θ) with h(θ) being a posterior distribution, we wish to estimate γ, the posterior mean of c(θ), where
γ ≡ E[c(θ)] = c(θ)h(θ)dθ.
If θ1, θ2, …, θN are independent and identically distributed (iid) as h(θ), then:
Noniterative Monte Carlo methods
If θ ∼ h(θ) with h(θ) being a posterior distribution, we wish to estimate γ, the posterior mean of c(θ), where
γ ≡ E[c(θ)] = c(θ)h(θ)dθ.
If θ1, θ2, …, θN are independent and identically distributed (iid) as h(θ), then:
1 N
γˆ = N
which converges to E[c(θ)] with probability 1 as N → ∞.
The computation of posterior expectations requires only a sample of size N from the posterior distribution.
c(θi) (27)
i=1
Noniterative Monte Carlo methods
The variance of γˆ can be estimated from the sample variance of the c(θi) values.
1 N
sˆe(γˆ) =N(N − 1)
The Central Limit Theorem implies that γˆ ± 2sˆe(γˆ) provides the approximated 95% confidence interval.
N can be chosen as large as necessary to provide a desirable confidence interval.
[c(θi) − γˆ]2 (28)
i=1
Noniterative Monte Carlo methods
In the univariate case, a histogram of the sampled θi estimates the posterior itself.
An estimate of p ≡ P{a < c(θ) < b} is given by
pˆ = number of c(θi ) ∈ (a, b) (29)
N
In contrast to asymptotic methods, accuracy improves with N, the Monte Carlo sample size (which we can choose and have control upon) rather than n the size of the data set (which can may not be able to control).
Noniterative Monte Carlo methods
What happens if we can’t directly sample from this distribution? There are methods for indirect sampling of the posterior distribution: (i) importance sampling, (ii) rejection sampling, (iii) weighted bootstrap.
Rejection sampling
If we identify an envelope function g(θ) and a constant M > 0 such that L(θ)π(θ) < Mg(θ) for all θ, then:
1 Generate θi ∼ g(θ),
2 Generate U ∼ Uniform(0, 1),
3 IfMUg(θi)
Gibbs sampler
1 The parameter space must be fully connected, without ”holes”.
2 When θ and ν are highly correlated the chain will have a ”slow mixing”.
3 To ensure that all the full conditional distributions are available, the prior distribution of each parameter can be chosen to be conjugate with the corresponding likelihood.
Gibbs sampler
A histogram of {θ(t), t = t0 + 1, …, T } provides an estimator of i
the marginal posterior distribution for θi .
The posterior mean can be estimated as the posterior mean:
1T
Eˆ(θi|y)= θ(t) (30)
T − t0
The time 0 <= t <= t0 is called the burn-in period.
t=t0+1
i
Metropolis algorithm
Given:
p(θ|⃗y ) ∝ h(θ) ≡ f (y |θ)π(θ)
a candidate, or proposal, symmetric density q(θ∗|θ(t−1)) which satisfies q(θ∗|θ(t−1)) = q(θ(t−1)|θ∗),
a starting value θ(0) at iteration t = 0,
for (t = 1, ..., T ) the algorithm repeats:
1 Draw θ∗ = q(·|θ(t−1))
2 Calculate r = h(θ∗)/h(θ(t−1))
3 If r ≥ 1, set θ(t) = θ∗, otherwise set θ(t) = θ∗ with probability
r or set θ(t) = θ(t−1) with probability 1 − r.
θ(t) converges in distribution to a draw from the true posterior density p(θ|y).
Metropolis algorithm
A usual candidate density is:
q(θ∗|θ(t−1)) = N(θ∗|θ(t−1),Σ ̃) (31)
random walk Metropolis: symmetric and ”self-correcting” distribution.
Σ ̃, the posterior variance, can be empirically estimated from a preliminary run.
Metropolis-Hastings algorithm
When q(θ ∗ |θ(t−1)) ̸= q(θ(t−1)|θ∗) the acceptance rate r is:
r = h(θ∗)q(θ(t−1)|θ∗) (32)
h(θ(t−1))q(θ∗|θ(t−1))
A draw θ(t) converges in distribution to a draw from the true posterior density as t → ∞.
Hastings independence chain
If we set q(θ∗|θ(t−1)) = q(θ∗) then the proposal ignores the current value of the variable.
The acceptance rate is:
r = h(θ∗)/q(θ∗) (33) h(θ(t−1))/q(θ(t−1))
MCMC algorithms
Langevin-Hastings algorithm introduces a systematic drift in the candidate density.
Slice sampler algorithm uses auxiliary variables to expand the parameter space.
Hybrid forms combined multiple algorithm in a single problem.
Adaptive algorithms use the early output from a chain to refine the sampling as it progresses.
Convergence
Diagnostic strategy:
run parallel chains with starting points from a wide distribution;
visually inspect these chains;
for each graph calculate the scale reduction factor; investigate crosscorrelations among parameters.
Figure 18: Three chains mixing for increasing t.
Recap & Refresh
If there is no closed form for the posterior distribution: asymptotic methods (e.g. Normal approximation)
direct sampling (Monte Carlo)
non-iterative sampling (e.g. rejection algorithm)
iterative sampling: MCMC (Gibbs vs. Metropolis algorithms)
Intended Learning Outcomes
At the end of this day you will be able to:
appreciate the applicability of ABC,
describe the rejection algorithm,
critically discuss the choice of summary statistics, implement ABC methods in R.
Posterior probability distribution
p(θ|x) = p(x|θ)π(θ) p(x)
which can be difficult as the marginal likelihood
p(x) =
p(x|θ)π(θ)dθ
might involve a high dimensional integral difficult (or impossible) to solve.
Sampling from the posterior
If the likelihood can be evaluated up to a normalising constant, Monte Carlo methods can be used to sample from the posterior.
If the likelihood function becomes difficult to define and compute, it is easier to simulate data samples from the model given the value of a parameter.
Rejection algorithm
If data points are discrete and of low dimensionality, given observation y, repeat the following until N points have been accepted:
1 Draw θi ∼ π(θ)
2 Simulatexi ∼p(x|θi)
3 Rejectθi ifxi ̸=y
These are sampled from p(θ|x).
Rejection algorithm (elephants are back!)
Example:
We observe 4 herds arriving.
The likelihood is Poisson-distributed and the prior is
Gamma-shaped G(3,1).
The posterior distribution is Gamma distributed with shape
parameter 3 + 4 = 7 and scale/rate 0.5.
Let’s assume that we can’t evaluate the likelihood but we know how to simulate y given a certain value of our parameter θ. Exercise:
Calculate the posterior distribution of θ in R.
Rejection algorithm
If data points are continuous and of low dimensionality, given observation y, repeat the following until N points have been accepted:
1 Draw θi ∼ π(θ)
2 Simulatexi ∼p(x|θi)
3 Reject θi if ρ(xi,y) > ε
where ρ(·) is a function measuring the distance between simulated and observed points.
Rejection algorithm (hot water is back!)
Example (water temperature):
θ is continuous with prior distribution U(80.1,110.3). we have a single observation y = 91.3514.
as ρ(·) we use the Euclidean distance:
ρ(xi,y)= (xi −y)2 (34)
We can’t evaluate the likelihood function but we can simulate observations that are distributed according to it.
Exercise:
Calculate the posterior distribution of θ using R.
Rejection algorithm v2
Alternatively, ε is the proportion of accepted simulations (ranked by distance with observations). In this case one sets the number of simulations to be performed (not the number of accepted simulations).
Rejection algorithm v2
Exercise (water temperature):
1 Y = {91.34, 89.21, 88.98}
2 θhaspriorN(μ=90,σ2 =20)for80≥θ≤110
3 the simulating function is simulate <- function(param)
rnorm(n=1, mean=param, sd=sqrt(10))
4 the distance function is ρ(xi , Y ) =
5 N=10,000andε=0.05
Tasks:
j∈Y (xi−j)2 |Y |
√
1 plot the sampled prior distribution
2 plot the distribution of ranked distances with indication of 5%
threshold
3 plot the posterior distribution
4 calculate notable quantiles and HPD 95%
Rejection algorithm with high dimensionality
If data points are of high dimensionality, given observation y, repeat the following until N points have been accepted:
1 Draw θi ∼ π(θ)
2 Simulatexi ∼p(x|θi)
3 Reject θi if ρ(S(xi),S(y)) > ε
with S(y) being summary statistics.
Rejection algorithm
Figure 19: From Beaumont 2010 Annu Rev Ecol Evol Syst. Rejection- and regression-based approximate Bayesian computation (ABC).
Summary statistics
The choice of summary statistics is a mapping from a high dimension to a low dimension.
Some information is lost, but with enough summary statistics much of the information is kept.
The aim for the summary statistics is to satisfy the Bayes’ sufficiency:
Issues?
p(θ|x) = p(θ|S(x))
Summary statistics
The choice of summary statistics is a mapping from a high dimension to a low dimension.
Some information is lost, but with enough summary statistics much of the information is kept.
The aim for the summary statistics is to satisfy the Bayes’ sufficiency:
p(θ|x) = p(θ|S(x))
Issues? Solutions:
1 use a wider acceptance tolerance
2 perform a better sampling from the prior
Regression-based ABC
1 Given observation y repeat the following until M points have been generated: A. Draw θi ∼ π(θ); B. Simulate xi ∼ p(x|θi )
2 Calculate Sj (x ) for all j and kj
3 ρ(S(x),S(y)) :s (Sj(x) − Sj(y))2
j=1 kj kj
4 Choose ε such that the proportion of accepted points Pε = N
M 5 Weight the simulated points S(xi ) using Kε(ρ(S(xi ), S(y)))
ε−1(1 − (t/ε)2) for t ≤ ε Kε(t)= 0 fort>ε
6 Apply weighted linear regression to the N points that have nonzero weight to obtain an estimate of Eˆ(θ|S(x))
7 Adjust θi∗ = θi − Eˆ(θ|S(x)) + Eˆ(θ|S(y))
8 The θi∗ with weights Kε(ρ(S(xi),S(y))) are random draws from an approximation of p(θ|y).
MCMC-ABC
Initialise by sampling θ(0) ∼ π(θ). At iteration t ≥ 1,
1 Simulate θ′ ∼ K(θ|θ(t−1)) where K(·) is a proposal distribution that depends on the current value of θ
2 Simulate x ∼ p(x|θ′).
3 If ρ(S(x),S(y)) < ε (rejection step),
u ∼ U(0,1),
if u ≤ π(θ′)/π(θ(t−1)) × K(θ(t−1)|θ′)/K(θ′|θ(t−1)), update θ(t) = θ′;
otherwise
θ(t) = θ(t−1);
4 otherwise θ(t) = θ(t−1).
Model assessment
Model choice
Given a series of model μ1, μ2, ..., μN with prior probabilities
i π(μi ) = 1, it is of interest to calculate Bayes factors between two models i and j:
p(μi |x) ÷ p(μi ) (35) p(μj|x) p(μj)
Choice of summary statistics
The more the merrier?
Figure 20: Choosing summary statistics: the issue of pulling a short blanket.
Choice of summary statistics
1 One could calculate the ratio of posterior density with or without a particular summary statistic. Departures greater than a threshold are suggestive that the excluded summary statistic is important.
2 Different summary statistics can be weighted differently according to their correlation with some model parameters.
3 The number of summary statistics can also be reduced via multivariate dimensional scaling summary statistics should be scaled in order to have equal mean and variance, if normally distributed.
4 Even if there is no need of a strong theory relating summary statistics to model parameters, it is suitable to have some expectations.
Model validation
Validation is the assessment of goodness-of-fit of the model and comparing alternative models, to distinguish errors due to the approximation from errors caused by the choice of the model.
1 The distributions of simulated summary statistics are visualised and compared to the corresponding target statistic. If the target is outside, then this could be a problem in the model.
2 The observations are compared with the posterior predictive distribution. This can be done by simulating data with parameters drawn randomly from the current posterior distribution.
Applications of ABC in biology
Population genetics, agent-based models, protein interaction networks, speciation rates under a neutral ecological model, extinction rates from phylogenetic data, epidemiology, ...
Figure 21: Patin et al. (2009). Different models simulating the demographic regime of the African groups and the mean proportion of small distances (Ω0.5) obtained in comparisons with simulated statistics.
To ABC or not to ABC?
When a likelihood function is known and can be efficiently evaluated, then there is not advantage to use ABC.
When the likelihood function is known but difficult to evaluate in practise, the ABC is a valid option.
Many scenarios in evolutionary biology or ecology can be generated by simulations.
ABC can be useful for initial exploratory phase. Be careful with the choice of your priors!
Intended Learning Outcomes
At the end of this session you are now able to:
critically discuss advantages (and disadvantages) of Bayesian data analysis,
illustrate Bayes’ Theorem and concepts of prior and posterior distributions,
implement simple Bayesian methods in R, including sampling and approximated techniques,
apply Bayesian methods to solve problems in biology.
Please do not redistribute. These slides may contain material protected by copyright.