程序代写代做 chain go algorithm C Bayesian network Bayesian graph flex Bioinformatics 


Mathematics and Statistics¶
https://bitbucket.org/mfumagal/statistical_inference
Bayesian methods in biology¶
part 1: bayesian thinking¶

the eyes and the brain¶

“You know, guys? I have just seen the Loch Ness monster in Hyde ! Can you believe that?”

What does this information tell you about the existence of Nessie?

In the classic frequentist, or likelihoodist, approach you make some inferences based on all the data that you have observed. The only data that you observe here is me telling you whether or not I saw Nessie. In other words, your inference on whether Nessie exists (at Silwood!) or not will be solely based on such observations.
Let’s denote $D$ (data) as the set of observations specifying whether I tell you that I saw Nessie ($D=1$) or not ($D=0$). $D$ is our sample space, the set of all possible outcomes of the experiment, and $D = \{0,1\}$. We want to make some inferences on the probability that Nessie exists, or that it is true that I saw it (her?). Let’s denote this probability as $N$.
We can define a likelihood function for $p(D|N)$. For instance, assuming that our observation is $D=1$, we could set $p(D=1|N=0)=0.01$ (what are the chances that I did not see Nessie, so it doesn’t exist, but I tell I saw it?) and $p(D=1|N=1)=0.90$ (what are the chances that I did see Nessie and tell you I saw it so it exists?).
Let’s make it slightly more complicated and imagine that even the second and third lecturer of the day tell you that they saw Nessie. Let’s assume that $p(D=1|N=0)=0.01$ and $p(D=1|N=1)=0.90$ are valid for each observer/lecturer $l$.
Then the log-likelihood of $N=0$ is given by $\sum_{l=1}^3 log(p(D=1|N=0))=-6.91$ while the log-likelihood of $N=1$ is given by $\sum_{l=1}^3 log(p(D=1|N=1))=-0.32$.
TASK What is the Maximum Likelihood Estimate (MLE) of $N$?

In this very trivial example we maximise the likelihood function for $N$ and obtain $N_{MLE}=1$. The difference in likelihood between $N=1$ and $N=0$ gives us some sort of confidence level. The more data we have pointing towards one “direction”, the more confident we are of our inferences. Recalling our previous example, with 3 observations of $D=1$ we obtained a likelihood ratio (LR) of $6.59$. With 100 observations of $D=1$ we would obtain a LR of $219.72$. On the other hand, with only 1 observation, the LR would be $2.20$.
We can appreciate how our inference on $N$ is driven solely by our observations, and our inference is taken by the likelihood distribution. In a very informal (and possibly wrong) notation, we can write $p(N|D) \propto p(D|N)$, where we stress the conditional of $N$ on the observed data $D$.
An analogy here can explain this concept further.

Imagine that in the likelihood approach we use only one visual (or auditive) organ, i.e. our eyes (or ears).

However, in real life, we take many decisions based not solely on what we observe but also on some believes of ours.

We usually use another organ, the brain, to make inferences on the probability of a particular event to occur.

Note that in this cartoon the brain is “blind”, in the sense that it does not observe the data (no arrow pointing to the eye) but its inferences on the event are based on its own believes.

Back to the Loch Ness monster case, we can clearly have some believes whether or not Nessie exists not only because I told you I saw it in the campus. This “belief” expresses the probability of Nessie existing $p(N)$ unconditional of the data. Our intuition is that the probability of $N=1$ is somehow a joint product of the likelihood (the eyes) and the belief (the brain). Therefore, $p(N|D) \propto p(D|N)p(N)$.

How can we define $p(N)$? This depends on our blind “belief” function. If you are a Sci-Fi fan you might be inclined to set a higher probability (e.g. $p(N=0.2)$) than the one a pragmatical and sceptical person would set (e.g. $p(N=0.002)$). As an illustration, let’s assume that $p(D=1|N=0)=0.001$ and $p(D=1|N=1)=0.1$. In the “Sci-Fi brain”, $p(N=1|D=1) \ p(D=1|N=1)p(N=1) \propto 0.1*0.2 = 2e-2$. In the “sceptical brain”, $p(N=1|D=1) \propto 0.1*0.002 = 2e-4$. Note that these are not proper probabilities (we will see later how to calculate proper probabilities using “belief” functions). We can deduct how the choice of a different “belief” function can point us to either different conclusions or confidence levels.

In statistics, the “belief” function (e.g. $p(N)$) is called prior probability and the joint product of the likelihood ($p(D|N)$) and the prior ($p(N)$) is proportional to the posterior probability ($p(N|D)$).
The use of posterior probabilities for inferences is called Bayesian statistics.

Who, what and why?¶

Who¶
Bayesian statistics is called after Thomas Bayes (1701–1761), an English statistician, philosopher and Presbyterian minister.

Thomas Bayes never published his famous accomplishment in statistics.

His notes were edited and published after his death. He studied logic and theology and at the age of 33 he became a minister in a village in Kent. Only in his later years Thomas Bayes took a deep interest in probability.
The general interpretation of statistical inference called Bayesian was in reality pioneered by Pierre-Simon Laplace. In fact, some argue that Thomas Bayes intended his results in a very limited way than modern Bayesians would intend them. In the special case Thomas Bayes presented, the prior and posterior distributions were Beta distributions and the data came from Bernoulli trials. Interestingly, early Bayesian inference was called “inverse probability”, because it infers backwards from observations to parameters.

What¶
Bayesian statistics is an alternative to classical frequentist approaches, where maximum likelihood estimates (MLE) and hypothesis testing based on p-values are often used.
However, there is no definite division between frequentists and Bayesians as, in many modern applications, the approach taken is eclectic.

We now discuss further examples where a Bayesian approach seems the more appropriate strategy to adopt for statistics inference.

Imagine you have just submitted a manuscript for publication to a peer-reviewed journal. You want to assess its probability of being accepted and published.

This assessment may use, for instance, the information regarding the journal’s acceptance rate (say 20%), the quality of the study and its relevance to the journal’s scope.
Your manuscript is accepted and now you want to evaluate the probability that your next manuscript will be accepted. What is the direct estimate of this probability?
You had one success over one trial. Therefore, the probability is 100%. However, it looks clear that this estimate is somehow “wrong” as it is based on a small sample size and we know that the acceptance rate is anyway smaller than 100%.
You can think of the journal’s acceptance rate as our prior information. You are then tempted to set a probability of being accepted smaller than 100%. By doing so you are behaving as a Bayesian statistician, as you are adjusting the direct estimate in light of a prior information. Bayesian statistics have the ability to incorporate prior information into an analysis.
Suppose you are conducting an experiment of measuring the biodiversity of some species on particular rock shores in Scotland. Specifically, you are collecting the number of different species of algae in 4 different locations across time, over 3 years.
Year
Loc. A
Loc. B
Loc. C
Loc. D
2015
45
54
47
52
2016
41
n.a.
43
45
2017
32
38
37
35
Unfortunately, something happened in 2016 for Location B and you do not have data reported. What is a reasonable value for that entry, with no direct information? Would you think that 100 could be an answer? Perhaps 100 is too high since the numbers surrounding the entry may point towards a value of around 45. We could fit a model or take an average to impute the missing data.
Now assume that you have access to some data for Location B in 2016. Specifically, you have partial data where you could retrieve biodiversity levels only for a fifth $(1/5)$ of Location B for 2016. You extrapolate such partial value to obtain an estimate which turns out to be 100. Are you willing now to impute missing data with 100, extrapolated from some partial coverage, while before you thought this number was much higher than expected?
A more intuitive solution would be to take a sort of weighted average between this direct (but uncertain) measurement (100) and the indirect estimate you used (45, the average of surrouding cells) when there was no information available. Finally, imagine that you can retrieve biodiversity values for half $(1/2)$ of Location B in 2016. If so, then you would like to “weight” more such observation than to the previous case where only a fifth of the area was available. Bayesian statistics formalises such integration between direct and indirect information.
Let’s recapitulate all inference approaches discussed so far:
• The frequentist is based on imagining repeated sampling from a particular model, which defines the probability of the observed data conditional on unknown parameters.
• The likelihoodist uses the sampling model as the frequentists but all inferences are based on the observed data only.
• The Bayesian requires a sampling model (the likelihood) and a prior distribution on all unknown parameters. The prior and the likelihood are used to compute the conditional distribution of the unknown parameters given the observed data.
• The Empirical Bayesian (EB) allows the observed data to contribute to defining the prior distribution. 
To put it in a different perspective, assuming $D$ is the data and $\theta$ is your unknown parameter, the frequentist approach conditions on parameters and integrates over the data, $p(D|\theta)$. On the other hand, the Bayesian approach conditions on the data and integrates over the parameters, $p(\theta|D)$.
Therefore, in Bayesian statistics we derive proper probability distributions of our parameters of interest, rather than deriving a point estimate. In other words, in Bayesian statistics a probability is assigned to a hypothesis, while under a frequentist inference, a hypothesis is tested without being assigned to a probability of occurring. Unlike likelihoodist, Bayesian inferences can “accept” the null hypothesis rather than “fail to reject” it. Bayesian procedures can also impose parsimony in model choice and avoid further testing for multiple comparisons.
Why¶
The recent explosion of interest in Bayesian methods for data analysis is mainly because:
• of the increased computing power over the last years;
• they have good frequentist properties;
• their answers are more easily interpretable by non-specialists;
• they are already implemented in software packages. We will be able to appreciate these points later on after discussing some features and properties of Bayesian statistics.
Bayesian statistics is used in many topics in life sciences, such as genetics (e.g. fine-mapping of disease-susceptibility genes), ecology (e.g. agent-based models), evolution (e.g. inference of phylogenetic trees), bioinformatics (e.g. base calling).
In [ ]:

Bayesian concepts¶
We will now formalise Bayesian equations and provide case studies where such approach has been successfully used in life sciences.
Bayes’ theorem¶
If $Y$ is a random variable, then $f(\vec{y}|\vec{\theta})$ is a probability distribution representing the sampling model for the observed data $\vec{y}=(y_1,y_2,…,y_n)$ given a vector of unknown parameters $\vec{\theta}$.
The distribution $f(\vec{y}|\vec{\theta})$ is often called the likelihood and sometimes written as $L(\vec{\theta};\vec{y})$.
For the sake of simplicity, we assume $y$ and $\theta$ to be vectors, and thus replacing notations $\vec{y}$ and $\vec{\theta}$.
We know that $L(\theta;y)$ is not a probability distribution for $\theta$ given $y$. Therefore $\int L(\theta;y)d\theta$ is not necessarily equal to $1$ or even finite. Nevertheless it is possible to find the value of $\theta$ that maximises the likelihood function. In other words we can calculate a maximum likelihood estimate (MLE) for $\theta$, as \begin{equation} \hat{\theta}=argmax_{\theta}L(\theta;y) \end{equation}
In Bayesian statistics, $\theta$ is not a fixed (although unknown) parameter but a random quantity. This is done by adopting a probability distribution, called prior distribution, for ${\theta}$ that contains any information we have about ${\theta}$ not related to the data ${y}$.
Under this approach, inference on ${\theta}$ is then based on its posterior distribution given by \begin{equation} P({\theta}|{y}) = \frac{f({y}|{\theta})\pi({\theta})}{m({y})} = \frac{f({y}|{\theta})\pi({\theta})}{\int f({y}|{\theta}) \pi({\theta}) d{\theta}} \end{equation} This formula is known as Bayes’ Theorem. The posterior probability is simply the product of the likelihood and the prior, normalised so that integrates to $1$. The posterior distribution is therefore a proper (or legitimate) probability distribution.
Let’s prove this Theorem using a concrete example. We can also visualise probabilities using Venn diagrams (a useful but not very rigorous approach). Let’s talk about frogs, and saving them. The greatest loss of vertebrate biodiversity we observed in the past 30 years is due to a chytrids fungus which is responsible for the extinction of over a hundred species of amphibians.

Let’s assume we have a sample space $S$ with all the possible outcomes of an experiment and we are interested in a subset of $S$, representing only some events. In our example we are interested in detecting which samples of frogs are infected or not by the fungus. For doing so, during our fieldwork, we take some samples and test whether they are infected or not.
Let’s consider $S$ to consist of all the samples collected in a particular area. We can split our $S$ in two events: the event “samples with infection” (designated as set $A$), and “samples with no infection” (complement of set $A$, or $A^c$).
What is the probability that a randomly chosen sample is infected?
It is the number of elements in $A$ divided by the number of elements of $S$. We can denote the number of elements of $A$ as $|A|$, called the cardinality of $A$. We can define the probability of $A$, $P(A)$, as \begin{equation} P(A) = \frac{|A|}{|S|} \end{equation} with $0 \leq P(A) \leq 1$.
Assume that we use a molecular screening test which takes a biological sample (e.g. piece of skin) from a frog and tests for the presence of the fungus. Obviously, the test will be “positive” for some samples, and “negative” for some other samples. Let denote with event $B$ the collection of “samples for which the test is positive”. What is the probability that the test will be “positive” for a randomly selected sample? It will be \begin{equation} P(B) = \frac{|B|}{|S|} \end{equation} with $0 \leq P(B) \leq 1$.
We are now dealing with the entire sample space $S$ (all samples), the event $A$ (samples with infection), and the event $B$ (samples with a positive test). So far we have treated the two events separately, in isolation. What happens if we put them together? We can calculate the probability of both events occurring ($A \cap B$) simultaneously. \begin{equation} P(A \cap B) = \frac{|A \cap B|}{|S|} \end{equation} with $0 \leq P(A \cap B) \leq 1$. Note that sometimes $AB$ is used as shorthand notation for $A \cap B$. The event $AB$ represents “samples with infection and with a positive test”. Note that there is also the event $(B – AB)$ or “samples without infection and with a positive test”, and the event $(A – AB)$ or “samples with infection and with a negative test”.
The question we want to answer now is “given that the test is positive for a randomly selected sample, what is the probability that said sample is infected?”.
In terms of a Venn diagram, this question translates to “given that we are in region $B$, what is the probability that we are in region $A \cap B$?”. This is equivalent of saying that “if we make region $B$ our new Universe, what is the probability of $A$?”. The notation for the latter conditional probability is $P(A|B)$, called “the probability of $A$ given $B$”. This probability is equal to \begin{equation} P(A|B) = \frac{|A \cap B|}{|B|} = \frac{|A \cap B|/|S|}{|B|/|S|} = \frac{P(A \cap B)}{P(B)} \end{equation}
If we ask the opposite question “given that a randomly selected samples is infected (event $A$), what is the probability that the test is positive for that sample (event $A \cap B$)?”, the conditional probability is \begin{equation} P(B|A) = \frac{P(B \cap A)}{P(A)} \end{equation}
If we put together these last two equations (by $P(B \cap A)$) we obtain: \begin{equation} P(A)P(B|A) = P(B)P(A|B) \end{equation} It follows that \begin{equation} P(A|B) = \frac{P(A)P(B|A)}{P(B)} \end{equation} which is Bayes’ theorem.
A Normal/Normal model¶
We will now illustrate a simple Bayesian model to appreciate the interplay between likelihood and prior. Let’s use again the example of infected frogs and imagine we want to assess the rate of infection in a particular geographical area. Let’s assume that we monitored $20$ frogs in a given pond. We now want to make some inferences on the (total) number of infected frogs, $\theta$, in the whole area.
We consider the case where both the prior and the likelihood are Normal (Gaussian) distributions, that is \begin{align} f(y|\theta) &= N(y|\theta, \sigma^2)\ \pi(\theta) &= N(\theta|\mu, \tau^2) \end{align} Parameters $\mu$ and $\tau$ are known hyperparameters while $\theta$ is the unknown parameter.
As we will learn later on more formally, the posterior distribution $p(\theta|y)$ is also a Normal distribution with parameters \begin{equation} p(\theta|y) = N(\theta|\frac{\sigma^2\mu+\tau^2y}{\sigma^2+\tau^2},\frac{\sigma^2\tau^2}{\sigma^2 + \tau^2}) \end{equation} If we define $B$ as \begin{equation} B = \frac{\sigma^2}{\sigma^2+\tau^2} \end{equation} then we can write \begin{align} E(\theta|y) &= B\mu + (1-B)y \ Var(\theta|y) &= (1-B)\sigma^2 \equiv B\tau^2 \end{align} $B$ is called the shrinking factor because it gives the proportion for how much the posterior mean is “shrunk back” from the classical frequentist estimate $y$ towards the prior mean $\mu$. Note that $0 \leq B \geq 1$.
The posterior mean is a weighted average of the prior mean $\mu$ and the direct estimate $Y$. The weight on the prior mean $B$ depends on the relative variability of the prior distribution and the likelihood. If $\sigma^2>>\tau^2$ then $B \approx 1$ and our prior knowledge is more precise than the data information. On the other hand, if $\sigma^2<<\tau^2$ then $B \approx 0$ and our prior knowledge is imprecise and the final estimate will move very little towards the prior mean. We can appreciate this trade-off between data information and prior distribution with the following example. Assume that we have a single observation from one pond of $6$ infected frogs. In this case $y=6$ and we assume that our likelihood function (Normal)has $\sigma=1$. Furthermore, let's assume that we expected $2$ infected frogs before observing the data. Therefore our prior distribution (Normal) has $\mu=2$ and $\tau=1$, for instance. More formally, we can write \begin{align} f(y=6|\theta) &= N(y=6|\theta, 1) \ \pi(\theta) &= N(\theta|2, 1) \end{align} Let's use R to understand how this Normal/Normal model works in practice. Let's calculate and plot the prior, likelihood and posterior distribution. In [ ]: # prior mu <- 2 tau <- 1 x <- seq(-4,10,0.01) plot(x=x, dnorm(x=x, mean=mu, sd=tau), ylim=c(0,0.6), type="l", lty=1, ylab="Density", xlab=expression(theta), main="") legend(x="topleft", legend=c(expression(pi(theta)), expression(f(y~"|"~theta)), expression(p(theta~"|"~y))), lty=1:3) In [ ]: plot(x=x, dnorm(x=x, mean=mu, sd=tau), ylim=c(0,0.6), type="l", lty=1, ylab="Density", xlab=expression(theta), main="") legend(x="topleft", legend=c(expression(pi(theta)), expression(f(y~"|"~theta)), expression(p(theta~"|"~y))), lty=1:3) # prior # likelihood y <- 6 sigma <- 1 points(x=x, y=dnorm(x=y, mean=x, sd=sigma), type="l", lty=2) In [ ]: plot(x=x, dnorm(x=x, mean=mu, sd=tau), ylim=c(0,0.6), type="l", lty=1, ylab="Density", xlab=expression(theta), main="") legend(x="topleft", legend=c(expression(pi(theta)), expression(f(y~"|"~theta)), expression(p(theta~"|"~y))), lty=1:3) # prior points(x=x, y=dnorm(x=y, mean=x, sd=sigma), type="l", lty=2) # likelihood # posterior B <- sigma^2/(sigma^2+tau^2) postMean <- B*mu + (1-B)*y postVar <- B*tau^2 points(x=x, y=dnorm(x=x, mean=postMean, sd=sqrt(postVar)), type="l", lty=3) The prior distribution is centred around 2 ($\mu$), as expected. How about the likelihood function? It is centered around $6$ which is the only observation we have. The posterior distribution is centred exactly between the prior and the likelihood. Indeed, in this case $B=0.5$ and therefore prior and data are equally weighted. The maximum a posteriori probability (MAP) estimate is $4$, as it is equal to the mode of the posterior distribution. You can easily retrieve the MAP for this example in R. In [ ]: x[which.max(dnorm(x=x, mean=postMean, sd=sqrt(postVar)))] You may also observe that the posterior distribution is more skewed than the prior and the likelihood, despite these two distribution having the same variance. The posterior variance is smaller than the variance of either the prior or the likelihood. The precision, defined as the reciprocal of the variance, is the sum of the precisions in the prior and likelihood. The combined strength of prior and likelihood tends to increase the precision, or reduce the variance, in our inference of $\theta$. In this example, the precision is $1+1=2$, hence the variance is $1/2$. Therefore the posterior roughly covers $4 \pm 3(\sqrt{1/2}) \approx (1.88,6.12)$. What happens if we use a skewer (sharper) or wider prior? In other words, what is the shape of the posterior distribution if the variance of the prior is smaller (stronger belief) or larger (weaker belief)? TASK Recalling the previous example, assume that the prior distribution has $\mu=2$ and $\tau=0.5$ (before we had $\tau=1$). Calculate and plot prior and posterior distributions and evaluate the MAP in this case using R. Is the posterior mean closer or more distant from the prior mean? What is the shrinking factor? In [ ]: # skewer prior mu <- 2 tau <- ... x <- seq(-4,10,0.01) plot(x=x, dnorm(x=x, mean=mu, sd=tau), ylim=c(0,1), type="l", lty=1, ylab="Density", xlab=expression(theta), main="") legend(x="topleft", legend=c(expression(pi(theta)), expression(f(y~"|"~theta)), expression(p(theta~"|"~y))), lty=1:3) # likelihood ... # posterior ... # MAP and range ... TASK Recalling the previous example, assume that the prior distribution has $\mu=2$ and $\tau=2$ (before $\tau \leq 1$). Calculate and plot prior and posterior distributions and evaluate the MAP in this case using R. Is the posterior mean closer or more distant from the prior mean? What is the shrinking factor? In [ ]: # wider prior mu <- 2 tau <- ... x <- seq(-4,10,0.01) plot(x=x, dnorm(x=x, mean=mu, sd=tau), ylim=c(0,1), type="l", lty=1, ylab="Density", xlab=expression(theta), main="") legend(x="topleft", legend=c(expression(pi(theta)), expression(f(y~"|"~theta)), expression(p(theta~"|"~y))), lty=1:3) # likelihood #... # posterior #... # MAP and range #... Let's assume now that we have more observations of infected frogs in multiple ponds. For instance, let's imagine that $\vec{y}={6,5,5.5,6.5,6}$. Given a sample of $n$ independent observations, then \begin{equation} f(\vec{y}|\vec{\theta}) = \prod_{i=1}^n f(y_i|\vec{\theta}) \end{equation} However, we can also use a transformation if we can find a statistic $S(\vec{y})$ that is sufficient, meaning that $p(\vec{\theta}|\vec{y})=p(\vec{\theta}|S(\vec{y}))$. In this example, we can use the sufficient statistic $S(\vec{y})=\bar{y}$, where $\bar{y}$ is the mean of $\vec{y}$. The likelihood function has the form $f(\bar{y}|\theta)=N(\theta, \sigma^2/n)$ and the posterior distributions is \begin{equation} p(\theta|\bar{y})=N(\theta | \frac{(\sigma^2/n)\mu + \tau^2\bar{y}}{(\sigma^2/n) + \tau^2}, \frac{(\sigma^2/n)\tau^2}{(\sigma^2/n) + \tau^2} ) \end{equation} Suppose we keep $\mu=2$, $\sigma=\tau=1$ and set $\bar{y}=5.8$ and $n=5$. Let's have a look at the resulting posterior distribution. In [ ]: # prior (obviously it does not change) mu <- 2 tau <- 1 x <- seq(-4,10,0.01) plot(x=x, dnorm(x=x, mean=mu, sd=tau), ylim=c(0,2), type="l", lty=1, ylab="Density", xlab=expression(theta), main="") legend(x="topleft", legend=c(expression(pi(theta)), expression(f(y~"|"~theta)), expression(p(theta~"|"~y))), lty=1:3) # likelihood with more observations y <- c(6, 5, 5.5, 6.5, 6) n <- length(y) sigma <- 1 points(x=x, y=dnorm(x=x, mean=mean(y), sd=sigma/n), type="l", lty=2) # posterior with more observations postMean <- ( (sigma^2/n)*mu + tau^2*mean(y) ) / ( (sigma^2/n)*mu + tau^2 ) postVar <- ( (sigma^2/n)*tau^2 ) / ( (sigma^2/n) + tau^2 ) points(x=x, y=dnorm(x=x, mean=postMean, sd=sqrt(postVar)), type="l", lty=3) # MAP with more observations map <- x[which.max(dnorm(x=x, mean=postMean, sd=sqrt(postVar)))] cat("MAP:", map, "(", map-3*sqrt(postVar),",",map+3*sqrt(postVar),")\n") You can easily find that $\theta_{MAP}=4.43$ with a range of $(3.21, 5.65)$. The MAP has been shifted towards the MLE as we have more data information. Monte Carlo sampling¶ To derive the posterior distribution we can also draw random samples from it, instead of directly calculating the posterior mean and variance (if a Normal distribution is considered, for instance). This procedure is often called Monte Carlo sampling, after the city famous for its casinos. In the previous example of the Normal/Normal model with multiple observations, we were able to calculate the posterior mean ($4.43$) and posterior variance ($0.17$). From these parameters, we were able to derive (and plot) the density function, the posterior probability itself. Alternatively, we can randomly sample directly from the posterior distribution. Let's see how we can do this in R. In [ ]: ## Monte Carlo sampling par(mfrow=c(3,1)) # posterior x <- seq(2,8,0.01) postMean <- 4.43 postVar <- 0.16 plot(x=x, y=dnorm(x=x, mean=postMean, sd=sqrt(postVar)), type="l", lty=1, ylab="Density", xlab=expression(theta), main=expression(p(theta~"|"~y)), ylim=c(0,1.2), xlim=c(2,8)) # sampling y_sampled_1 <- rnorm(n=50, mean=postMean, sd=sqrt(postVar)) hist(y_sampled_1, breaks=20, freq=F, lty=2, col="grey", ylim=c(0,1.2), xlim=c(2,8), sub="50 samples", main=expression(p(theta~"|"~y)), xlab=expression(theta)) points(x=x, y=dnorm(x=x, mean=postMean, sd=sqrt(postVar)), type="l", lty=1) # more sampling y_sampled_2 <- rnorm(n=1e6, mean=postMean, sd=sqrt(postVar)) hist(y_sampled_2, breaks=20, freq=F, lty=2, col="grey", ylim=c(0,1.2), xlim=c(2,8), sub="1e6 samples", main=expression(p(theta~"|"~y)), xlab=expression(theta)) points(x=x, y=dnorm(x=x, mean=postMean, sd=sqrt(postVar)), type="l", lty=1, ylab="Density", xlab=expression(theta), main=expression(p(theta~"|"~y)), sub="1e6 samples") The more sampling we do, the closer our sampled distribution will be to the "true" posterior distribution. With 50 samples the empirical posterior mean is $4.444$ while with $1e6$ samples we have an empirical posterior mean of $4.429$ which is very close to our direct estimate of $4.43$. With 50 samples the empirical posterior variance is $0.105$ while with $1e6$ samples we have an empirical posterior variance of $0.159$ which is very close to our direct estimate of $0.16$. You can eaisly see how, in this simple Normal/Normal case, Monte Carlo methods are not strictly necessary since the integral in the denominator of Bayes' theorem can be evaluated in closed forms (as we will see later on). In these such cases, it is preferable prefer to derive a smooth curve rather an a histogram of sampled values, and have the corresponding exact values for the posterior parameters. However there are cases where, given the choice for the likelihood and prior functions, this integral cannot be evaluated. Therefore, in these cases, Monte Carlo methods are to be preferred for estimating, or rather approximating the posterior distribution. As any sample can be drawn from any posterior regardless of how many parameters $\theta$ you may have, we have the ability to work on problems with (theoretically) unlimited complexity, at the price of not obtaining an exact form for the posterior and performing a large number of samplings. Practical: reconstructing genomes from sequencing data¶ We want to use a Bayesian approach to reconstruct genomes from data produced from high-throughput sequencing machines. First, load the R functions needed with source("functions.R"). Among these functions, we provide one that calculates the likelihood of a certain sequence of bases for diploid individuals. This function is called calcGenoLikes and takes 5 paramaters in input: • the sequence itself (collection of bases) • the first allele of the genotype • the second allele of the genotype • the sequencing error rate • a boolean indicating whether the results should be returned in logarithmic scale (TRUE) or not (FALSE) For instance, assuming that your sequence is AAGAGGA, your alleles are A and G (meaning that you want to calculate the likelihood for genotypes {AA,AG,GG}, and your sequencing error rate is 0.05, then the likelihood (not in logarithms) for each genotype is given by calcGenoLikes("AAGAGGA", "A", "G", 0.05, FALSE) Complete all the following tasks using R when necessary. The key point of these exercises is to not recalculate quantities that you have already computed. The aim is that you should be able to understand whether the likelihood or the prior is the same (or not) between different scenarios. A) Using Bayes' theorem, write the formula for the posterior probability of genotype G being AA given the sequencing data D. Write the explicit denominator assuming that your alleles are A and G and all possible genotypes are only AA, AG, GG. ... B) Assuming that your data is AAAG, your alleles are A and G, and the sequencing error rate is 0.01, calculate genotype posterior probability using a uniform prior, e.g. $P(G=AA) = P(G=AG) = P(G=GG) = ?$. In [ ]: ... C) With the same assumptions as in point B, calculate genotype posterior probabilities using prior probabilitties based on Hardy Weinberg Equilibrium with a frequency of G of 0.1. Do you need to calculate a new likelihood or is it the same one as in point B? In [ ]: ... D) With the same assumptions as in point C, calculate genotype posterior probabilities using a prior based on Hardy Weinberg Equilibrium with a frequency of G of 0.1 and an inbreeding coefficient of 0.2. In this case, we need to modify our previous priors. Specifically, if $f$ is the frequency of allele A and $I$ is the inbreeding coefficient, then the prior probabilities for all genotypes are: • $p(AA)=f^2 + I \times f \times (1-f)$ • $p(AT)=2 \times f \times (1-f) \times (1-I)$ • $p(TT)=(1-f)^2 + I \times f \times (1-f)$ Do you need to calculate a new likelihood or is it the same one as in points B and C? ... E) With the same priors used in point D but with a sequencing error rate of 0.05, calculate the genotype posterior probabilities. Do you need to calculate a new likelihood or is it the same one as in point D? In [ ]: ... F) Plot all previous results (e.g. use a barplot with the 3 posterior probabilities for each scenario B-E). In [ ]: ... G) Assuming that our collection of sequenced bases is AAAGAGAAAAAAAGGGGGAAAGGA, calculate the genotype posterior probabilities using the same priors as in point C. What happens if we have more data? What is the confidence in our genotype inference? In [ ]: ... H) What happens if we have a lot of data? Assume that your sequenced bases are bases <- paste(c(rep("A",1e3),rep("G",1e3)), sep="", collapse=""). What happens here? It is convenient to use numbers in log-scale: $log(a \times b / c) = log(a) + log(b) -log(c)$. You can do that by selecting TRUE as the last parameter in the calcGenoLikes. What is the effect of the prior here? Remember that if you want to calculate proper probabilities (in log) you have to approximate the sum of logs. In [ ]: ... Prior distributions¶ One of the main feature of Bayesian statistics is that we assign probability distributions not only to data variables ${y}$ but also to parameters ${\theta}$. In other words, we quantify whatever feelings or believes we have about ${\theta}$ before observing ${y}$. Using Bayes' theorem, we obtain a posterior distribution of ${\theta}$, a blend of the information between the data and the prior. How can we decide which prior distribution is more appropriate in our study? As you can imagine this can be an arduous task and hampered the use of Bayesian statistics in the past. In general, prior distributions are derived from past information or personal opinions from experts. It is typical to restrict prior probabilities to be distributed as familiar distribution families. Alternatively, one can limit the prior distribution to bear little information. We now review some possibilities to assign prior distributions. Elicited priors¶ The simplest approach to specify $\pi(\theta)$ is to define the collection of $\theta$ which are possible. Then one can assign some probability to each one of these cases and make sure that they sum to $1$. If $\theta$ is discrete, this looks like a natural approach. Let's understand this use of elicited priors with an example. Imagine that your prior distribution describes the number of kits a mother rabbit will have in the next litter.  Perhaps, you want to make some inference on the biological mechanisms for the number of kits. In this case you may have a likelihood function relating some observations $\vec{y}$ (e.g. genetic or environmental markers) to the number of kits $\theta$. $\theta$ is clearly discrete and you may have some past information on its distribution. For instance, from the literature, we know that "Rabbits can have anywhere from one to 14 babies, also called kits, in one litter. An average litter size is 6. Hereditary and environmental factors play a role in the number of kits born in a litter." Therefore, as a first try, you can assign \begin{equation} \pi(\theta=0)=\pi(\theta>14)=0 \end{equation} Secondly, you can impose that it is more probable that this mother will have $6$ kits, as this is the average litter size based on past information, as \begin{equation} \pi(\theta=2)<\pi(\theta=6)>\pi(\theta=10) \end{equation} Finally, you must ensure that \begin{equation} \sum_{i=1}^{14} \pi(\theta=i) = 1 \end{equation}
On the other hand, it $\theta$ is continuous, a simple solution would be to discretise our prior distribution, by assigning masses to intervals. In other words, you create a histogram prior for $\theta$. Imagine that your prior distribution specifies the recorded temperature in hot springs at Lassen Volcanic National Park. Specifically, you are interested in relating the temperature of different pools at Bumpass Hell with the occurrence of certain extremophile micro-organisms, capable of surviving in extremely hot environments.

For instance, you may want to predict the temperature of the pool from the observations of such micro-organisms. The latter distribution is the likelihood function. However, you also want to assign a prior distribution for the pool temperature, $\theta$. Clearly $\theta$ is continuous and, from past observations, we know that it has a range of $(80,110)$ with an average of $88$, in Celsius degrees. A simple solution would be to derive a prior histogram of $\theta$, as \begin{equation} \pi(80 \geq \theta < 85) < \pi(85 \geq \theta < 90) > \pi(90 \geq \theta < 95) \end{equation} Again, you have to make sure that all these probabilities sum to $1$. Furthermore, it is important that the histogram is sufficiently wide, as the posterior will have support only for values that are included in the prior. Alternatively, we may assume that the prior distribution for $\theta$ belongs to a parametric distributional family $\pi(\theta|\nu)$. Here we choose $\nu$ so that $\pi(\theta|\nu)$ closely matches our elicited beliefs. This approach has several advantages, including: • it reduces the effort to the elicitee (you don't have to decide a probability for each value $\theta$ can have); • it overcomes the finite support problem (as in the case of the histogram); • it may lead to simplifications in the computation of the posterior (as we will see later on). A limitation of this approach is that it would be impossible to find a distribution that perfectly matches the elicitee's beliefs. For instance, the prior for temperatures could be Normally distributed as $N(\mu,\sigma^2)$ bounded at $(80,110)$, that is \begin{align} \pi(\theta) &= 0 & \text{for } \theta < 80 \text{ or } \theta > 110 \\ \pi(\theta) &= N(\mu,\sigma^2) & \text{for } 80 \leq \theta \leq 110 \end{align}
with $\mu=88$ and $\sigma^2=10$. We can plot this distribution in R.
In [ ]:
## elicited prior
mu <- 88 sigma2 <- 10 x <- seq(80,110,0.1) plot(x=x, dnorm(x=x, mean=mu, sd=sqrt(sigma2)), type="l", lty=1, ylab="Density", xlab=expression(theta), main=expression(pi(theta))) Note that this distribution is not defined outside the interval $(80,110)$. Consequentely, the posterior distribution will not have mass outside this interval. Overconfidence may result into failing to condition on events outside the range of personal experience or previous observations. For instance, the fact that a temperature lower than $80$ has never been observed may be better represented by setting a small (but greater than 0) probability of occurring. As a rule of thumb, for elicited priors, it is recommended to focus on quantiles close to the middle of the distribution (e.g. the $50^{th}$, $25^{th}$ and $75^{th}$) rather than extreme quantiles (e.g. the $95^{th}$ and $5^{th}$). You should also assess the symmetry of your prior. Elicited priors can be updated and reassessed as new information is available. They are very useful for experimental design where some ideas on the nature of the studied system is given in input. Conjugate priors¶ When choosing a prior distribution $\pi(\theta|\nu)$ some family distributions will make the calculation of posterior distributions more convenient than others will do. It is possible to select a member of that family that is conjugate with the likelihood $f(y|\theta)$, so that the posterior distribution $p(\theta|y)$ belongs to the same distributional family as the prior. Let's illustrate this point with an example. Suppose we are interested in modelling the arrival of herds of elephants to a specific water pond in the savannah in a day during the migratory season. We may be interested in this estimate for tracking migratory routes or assessing population sizes. $Y$ is the count of distinct elephant herds or groups (not the number of elephants itself!) arriving at the pool in a day during the migration season (not during the whole year!).  A Poisson distribution has a natural interpretation to model arrival rates for discrete variables. Indeed, the Poisson distribution is a discrete probability distribution that gives the probability of a given number of events occurring in a fixed interval of time (or space) when such events occur independently and with a known average rate. The Poisson distribution is an appropriate model under certain assumptions: 1. $Y$ is the number of times an event occurs in an interval and it can take values any positive integer values including 0; 2. the occurrence of one event does not affect the probability that a second event will occur (i.e. events occur independently); 3. the rate at which events occur is constant (it cannot be higher in some intervals and lower in other intervals); 4. two events cannot occur at exactly the same instant; 5. the probability of an event in an interval is proportional to the length of the interval. Condition 1 is clearly met in our case. Conditions 2 and 4 assumes that different herds do not follow each other (perhaps by taking different routes). For the sake of illustrating this distribution, we will assume this to be true. You can see that if $Y$ were the number of elephants (not the herds) then condition 2 is not met as elephants tend to migrate in group. Condition 3 is met when we focus our analysis on the annual period where we expect to see herds, not during the whole year. Condition 5 is easily met, as the number of herds arriving in a week is likely to be higher than the number in a day. If we assume that all these conditions are true, then $Y$, the number of elepehant herds arriving at a pool, is a Poisson random variable. The event under study ($y$, number of herds) can occur $0, 1, 2, ...$ times in the interval (a day). The average number of events (herds arriving) in an interval (one day) is designated our parameter $\theta$. Note that the parameter of the Poisson distribution is typically written as $\lambda$. In our example, $\theta$ is the event rate, or the rate parameter. As such, the probability of observing $y$ events in an interval is given by \begin{equation} f(y|\theta) = \frac{e^{-\theta}\theta^y}{y!} \text{, } y \in {0, 1, 2, ...} \text{, } \theta>0 \end{equation}
This is our likelihood distribution and, once we know $\theta$, then the whole distribution is defined. As you can see, $\theta$ has to be positive (not necessarily an integer) and $y$ is a positive integer. Note that the equation above is a probability mass function (pmf), as it is defined only for discrete values of $y$.
For instance, assuming that the rate $\theta$ is set to $4$ (4 herds per day during migration season), then: \begin{align} P(y=0) &= \frac{e^{-4}4^0}{0!} = e^{-4} = 0.0183\ P(y=1) &= \frac{e^{-4}4^1}{1!} = … = 0.0733\ P(y=2) &= \frac{e^{-4}4^2}{2!} = … = 0.147 \end{align} We can plot this likelihood distribution for $\theta=4$ using R.
In [ ]:
## Poisson distribution
theta <- 4 y <- seq(0, 20, 1) plot(x=y, dpois(x=y, lambda=theta), type="p", lty=1, xlab=expression(y), main=expression(theta~"="~theta), ylab=expression(p(Y~"="~y~"|"~theta))) As you can see, the highest mass is towards 4 and above 12 the probability is very close to 0. You may recall that a Poisson distribution has expected value and variance equal to the rate parameter. Note that we have some non-zero probability of observing 0 events. If we want to perform a Bayesian analysis, we need to define a prior distribution for $\theta$ having support for positive values. In other words, we assume that we don't know the exact value of $\theta$, but we can make some assumptions on its probability distribution. A reasonable choice (as we will appreciate later) is given by the Gamma distribution \begin{equation} \pi(\theta) = \frac{\theta^{\alpha-1}e^{-\theta/\beta}}{\Gamma(\alpha)\beta^\alpha} \text{, } \theta>0 \text{, } \alpha>0 \text{, } \beta>0 \end{equation}
The Gamma distribution will be our prior distribution. We can denote this as $\theta \sim G(\alpha,\beta)$. As you can evince, the Gamma distribution is a two-parameter family of continuous probability distributions. Please note that the common exponential distribution and chi-squared distribution are special cases of the Gamma distribution. We have also suppressed the dependency of $\pi$ to hyperparameters $\nu=(\alpha,\beta)$ since we assume them to be known. $\alpha$ is known as the shape parameter while $\beta$ is the rate parameter. The expected value and variance for the Gamma distribution are \begin{align} E(G(\alpha,\beta)) &= \alpha\beta\ Var(G(\alpha,\beta)) &= \alpha\beta^2 \end{align} Let’s have a look at the shape of the Gamma distribution for different values of its parameters using R.
In [ ]:
## Gamma distribution
alpha <- c(0.5,2,10) beta <- c(0.5,2) thetas <- seq(0, 20, 0.1) mycolors <- topo.colors(6) plot(x=thetas, dgamma(x=thetas, shape=alpha[1], scale=beta[1]), type="l", lty=1, xlab=expression(theta), main="Gamma", ylab=expression(pi(theta)), ylim=c(0,1.0), col=mycolors[1], lwd=2) index <- 0 for (i in alpha) { for (j in beta) { index <- index+1 points(x=thetas, dgamma(x=thetas, shape=i, scale=j),col=mycolors[index], ty="l", lwd=2) } } names <- cbind(rep(alpha,each=2),rep(beta)) legend(x="topright", legend=apply(FUN=paste, MAR=1, X=names, sep="", collapse=", "), col=mycolors, lty=1, lwd=2) As you can see from this plot, the Gamma distribution is very flexible and it can have one tail ($\alpha \leq 1$) or two tails ($\alpha > 1$). For very large values of $\alpha$ the Gamma distribution resembles a Normal distribution. The $\beta$ parameter shrinks or stretches the distribution relative to 0 but it doesn’t change its shape.
Using the Bayes’ theorem, we can now obtain the posterior probability \begin{align} P(\theta|y) &\approx f(y|\theta)\pi(\theta)\ &\approx (e^{-\theta}\theta^y)(\theta^{\alpha-1}e^{-\theta/\beta})\ &= \theta^{y+\alpha-1}e^{-\theta(1+1/\beta)} \end{align} Since we are only interested in a normalised function of $\theta$, we drop all functions that do not depend on $\theta$. You may realise that this posterior distribution is actually a Gamma distribution $G(\alpha’,\beta’)$ with $\alpha’=y+\alpha$ and $\beta’=(1+1/\beta)^{-1}$. We were able to do this operation because the Gamma distribution (our prior) is the conjugate family for the Poisson distribution (our likelihood).
To finalise our example of elephant herds, suppose that, before looking at the actual data, we have some intuition that we expect to see 3 herds per day. This could be derived from observations in previous years. Let’s also assume that we are not very confident and we assign a large variance to it. For instance, our prior distribution is $G(3,1)$ which points to an expected value and variance of 3. We then observed 4 herds ($y=4$). Therefore, our posterior distribution will be a gamma distribution $G(3+4,1/(1+1/1)$.
Let’s plot these distributions in R. As an illustration, we also plot the posterior distribution obtained by Monte Carlo sampling. Conjugate priors allow for posterior distributions to emerge without numerical integration.
In [ ]:
## Gamma posterior
alpha <- 3 beta <- 1 theta <- seq(0, 20, 0.1) prior <- dgamma(x=theta, shape=alpha, scale=beta) y <- 4 posterior <- dgamma(x=theta, shape=y+alpha, scale=1/(1+1/beta)) plot(x=theta, y=posterior, xlab=expression(theta), ylab="Density", type="l") lines(theta, prior, lty=3) postdraw <- rgamma(n=1e5, shape=y+alpha, scale=1/(1+1/beta)) histdraw <- hist(postdraw, breaks=20, plot=F) lines(histdraw, lty=3, col="grey", freq=F) Hierarchical modelling¶ There are many other ways to build a prior distribution. As we have seen so far, a posterior distribution is typically obtained with two stages, one for $f({y},{\theta})$, the likelihood of the data, and $\pi({\theta}, {\nu})$, the prior distribution of ${\theta}$ given a vector of hyperparameters ${\nu}$. Note that we drop the notation of all these parameters being vectors for simplicity. Let's suppose that we are uncertain about the values for ${\nu}$. In this case, we need an additional stage, a hyperprior, defining the density distribution of hyperparameters. If we denote this distribution as $h({\nu})$, then the posterior distribution is \begin{equation} P({\theta}|{y}) = \frac{ \int f({y}|{\theta})\pi({\theta}|{\nu}) h({\nu})d{\nu} }{ \int \int f({y}|{\theta})\pi({\theta}| {\nu})h({\nu})d{\nu}d{\theta} } \end{equation} Another possibility is to replace ${\nu}$ with an estimate $\hat{{\nu}}$ obtained by maximising the marginal distribution $m({y}|{\nu})$. In this way, our inferences are made on the estimated posterior $P({\theta}|{y},\hat{{\nu}})$, by inserting $\hat{{\nu}})$ in the Bayes' theorem equation. This approach is called Empirical Bayesian analysis as we are using the data to estimate the hyperparameter. The empirical estimation of the prior seems a violation of Bayesian principles. Indeed, the update of the prior based on the data would use the data twice, both for the likelihood and the prior. Inferences from such modelling tend to be "overconfident" and methods that ignore this fact are called naive Empirical Bayesian approaches. Furthermore, we can again think of ${\nu}$ depending on a collection of unknown parameters ${\lambda}$, with $h({\nu}|{\lambda})$ and a third-stage prior $g({\lambda})$. This procedure of specifying a model over several layers is called hierarchical modelling. This framework is very much used in graphical modelling (e.g. Bayesian networks). As we add extra layers and levels of randomness, subtle changes at the top levels (hyperpriors) will not have a strong effect at the bottom level (the data). Noninformative priors¶ It is often the case that no reliable prior information on ${\theta}$ is available. Can we employ a Bayesian approach under such circumstances? It is still appropriate if we find a distribution $\pi({\theta})$ that contains "no information" about ${\theta}$, in the sense that it does not favour one value over another. We refer to such a distribution as a noninformative prior for ${\theta}$. All the information in the posterior will arise from the data. Suppose that the parameter space is discrete and finite, that is, $\vec{\Theta}=\{\theta_1, \theta_2, ..., \theta_n\}$. What is a possible noninformative prior? One possibility is \begin{equation} p(\theta_i) = \frac{1}{n} \text{, } i=1, 2, ...,n \end{equation} This prior distribution does not favour any value and therefore it is noninformative about $\theta$. Furthermore, note that even in this case the prior distribution must be a proper (legitimate) probability distribution, meaning that \begin{equation} \sum_1^n \frac{1}{n}=1 \end{equation} If $\vec{\Theta}$ is continuous and bounded, as $\vec{\Theta}=[a,b]$ with $-\infty How about the case of $\vec{\Theta}$ being continuous and unbounded, so that $\vec{\Theta}=(-\infty,+\infty)$? A noninformative prior could be set as \begin{equation} P(\theta) = c \text{, any } c>0 \end{equation} However, this distribution is clearly improper as \begin{equation} \int_{-\infty}^{+\infty} p(\theta) d\theta = +\infty \end{equation} This may suggest that in this case a Bayesian approach cannot be used. However a Bayesian inference is still possible if the integral of of the likelihood $f({y}|\theta)$ with respect to $\theta$ equals some finite value $K$, since \begin{equation} \int \frac{f({y}|\theta) \cdot c}{\int f({y}|\theta) \cdot c d\theta} d\theta = 1 \end{equation}
Let’s go back to our example on herds of elephants. Suppose that we don’t have information on past rates of arrivals and therefore we cannot (or don’t want to) use the Gamma distribution as prior distribution as previously done. Instead, we use a uniform prior $U$ for the mean arrival $\theta$, our parameter. What is the interval of $\theta$? Theoretically, it should be $[0,+\infty)$. However let’s limit it to a large number and define a uniform prior distribution as $U(0,100)$. Let’s plot this distribution in R.
In [ ]:
## Uniform distribution
theta <- seq(0, 100, 0.1) prior <- dunif(x=theta, min=0, max=100) plot(x=theta, y=prior, xlab=expression(theta), ylab="Density", type="l") This choice rules out scenarios that are impossible in real life. This means that the posterior will be truncated at 0 and 100. As we lack a conjugate model, we can sample from the posterior to obtain its distribution. We will see later how we can do this using computational methods. Noninformative priors are related to the notion of reference priors. These are not necessarily noninformative but a convenient, default choice for prior distributions. Bayesian inference¶ Once we have specified the prior, we can use Bayes' theorem to obtain the posterior distribution of model parameters. However, the density (or cumulative) function can be difficult to interpret. Therefore we want to summarise the information enclosed in these distributions. Here we discuss how to develop Bayesian techniques for point estimation, interval estimation, and hypothesis testing. Point estimation¶ Let's first consider the univariate case. We want to select a summary feature of $p(\theta|{y})$ to obtain a point estimate $\hat{\theta}({y})$, which could be either its mean, median, or mode. These features may behave very differently depending on the distribution, especially when it is asymmetric and one-tailed. Recalling the example of elephant herds, assuming $y=1$ and that our prior is $G(0.5,1)$, let's calculate and plot the posterior distribution. In [ ]: ## Gamma posterior asymmetric, one-tailed prior alpha <- 0.5 beta <- 1 theta <- seq(0, 20, 0.1) prior <- dgamma(x=theta, shape=alpha, scale=beta) y <- 1 posterior <- dgamma(x=theta, shape=y+alpha, scale=1/(1+1/beta)) plot(x=theta, y=posterior, xlab=expression(theta), ylab="Density", type="l") lines(theta, prior, lty=3) What is the mean, mode and median of the resulting posterior distribution? Generally speaking, the mode is the easiest to calculate. Since no normalisation is required, we can work directly with the numerator. Note that if the prior distribution is flat, then the posterior mode will be equal to the maximum likelihood estimate of $\theta$. In this case, it is called the generalised maximum likelihood estimate of $\theta$. If the posterior distribution is symmetric, then the mean and the median are equivalent. For symmetric unimodal distributions, all these three features are equivalent. For asymmetric distributions, the median is often the best choice as it is less affected by outliers and it is an intermediate to the mode and the mean. If we want to obtain a measure of accuracy of a point estimate $\hat{\theta}({y})$, we can calculate the posterior variance \begin{equation} \mathrm{Var}{\theta|{y}}(\theta) = \mathrm{E}{\theta|{y}}[\theta-\mathrm{E}_{\theta({y})}]^2 \end{equation} The posterior mean minimises the posterior variance in respect to $\hat{\theta}({y})$. In the multivariate case, we can obtain the posterior mode as $\hat{\vec{\theta}}({y})=(\hat{\theta_1},\hat{\theta_2},...,\hat{\theta_k})$. If the mode exists, maximisation methods (e.g. grid search, golden section search, Newton-type methods, ...) are typically employed to locate the maximum. We can calculate the posterior mean which, again, minimises the posterior covariance matrix with respect to $\hat{\vec{\theta}}({y})$. Credible intervals¶ The Bayesian analogue of a frequentist confidence interval is called a credible set. A $100 \times (1-\alpha)$ credible set for ${\theta}$ is a subset $C$ of ${\Theta}$ such that \begin{equation} 1-\alpha \leq P(C|{y}) = \int_C p({\theta}|{y})d{\theta} \end{equation} in the continuous case. In the discrete case the integral is replaced by a summation. This definition can express a likelihood of $\theta$ falling in $C$ as "the probability that $\theta$ lies in $C$ given the observed data $y$ is at least $(1-\alpha)$". Unlike the frequentist case, the credible set provides an actual probability statement, based on both the observed data and prior opinion. In discrete settings it may not be possible to find the exact coverage probability $(1-\alpha)$. In continuous settings we can calculate the highest posterior density, or HPD, credible set, defined as \begin{equation} C = { \theta \in \Theta : p(\theta|y) \geq k(\alpha) } \end{equation} where $k(\alpha)$ is the largest constant satisfying $P(C|y)\geq(1-\alpha)$. Let's recall the example of elephants herds and assume that the posterior distribution $p(\theta|{y}) \sim G(2,1)$. In [ ]: ## Interval estimation theta <- seq(0, 10, 0.05) alpha <- 2 beta <- 1 posterior <- dgamma(x=theta, shape=alpha, scale=beta) plot(x=theta, y=posterior, xlab=expression(theta), ylab="Posterior density", type="l") There are several ways to define a credible interval. In the HPD, we choose the narrowest interval which, for a unimodal distribution, involves choosing those values of highest probability density. This interval will include the mode for a unimodal distribution. For instance, drawing a line at $k(\alpha)=0.1$ results in a $87\%$ HPD. In [ ]: plot(x=theta, y=posterior, xlab=expression(theta), ylab="Posterior density", type="l") abline(h=0.1, lty=3) We can calculate the interval of values of $\theta$ included in the $87\%$ HPD. In [ ]: # install.packages("coda") # uncomment if you don't have "coda" R package already installed library(coda) x <- rgamma(n=1e5, shape=alpha, scale=beta) plot(x=theta, y=posterior, xlab=expression(theta), ylab="Posterior density", type="l") hpd <- HPDinterval(as.mcmc(x), prob=0.87) hpd abline(v=hpd, lty=1) Another common strategy to obtain confidence intervals is to choose the interval where the probability of being below it is as likely as being above it. For instance, if $a=1-0.87$ assuming a $87\%$ HPD, then the equal-tailed interval corresponds to the $\{a/2,1-a/2\}$- quantiles of the distribution. This interval includes the median and, if the distribution is symmetric, both credible intervals will be the same. We can easily calculate such quantiles in R. In [ ]: a <- 1-0.87 eqi <- quantile(x, probs=c( a/2, 1-(a/2) ) ) plot(x=theta, y=posterior, xlab=expression(theta), ylab="Posterior density", type="l") abline(v=eqi, lty=2) How can we summarise our results? Typically, we can report: • the posterior mean, • several posterior percentiles (e.g. 0.025, 0.25, 0.50, 0.75, 0.975), • a credible interval, • posterior probabilities $p(\theta>c|y)$ where $c$ is a notable point (e.g. 0, 1, depending on the problem),
• a plot of the distribution to check whether it is unimodal, multimodal, skewed, etc etc.
Hypothesis testing¶
The frequentist approach to compare predictions made by alternative scientific explanations is based on classic ideas of Fisher, Neyman and Pearson. Typically, one formulates a null hypothesis $H_0$ and an alternative hypothesis $H_a$. Then an appropriate test statistic is chosen $T({Y})$. Finally, one computes the observed significance, or p-value, of the test as the chance that $T({Y})$ is “more extreme” that $T(y_{obs})$, where the “extremeness” is towards the alternate hypothesis. If the p-value is less than some threshold, typically in the form of a pre-specified Type I error rate, $H_0$ is rejected, otherwise it is not.
While widely used, there are few criticisms to this approach:
• it is applied only when two hypotheses are nested, one within the other; typically, $H_0$ is a simplification of $H_a$ and involves setting one parameter of $H_a$ to some known constant value;
• it offers evidence against the null hypothesis; a large p-value does not mean that the two models are equivalent, but only that we lack evidence of the contrary; we don’t “accept the null hypothesis” but “fail to reject it”;
• a p-value does not offer a direct interpretation in terms of weight of evidence but only as a long-term probability (a p-value is not the probability that $H_0$ is true!).
The Bayesian approach to hypothesis testing is simpler and more intuitive. Basically, one calculates the posterior probability that the first hypothesis is correct. In general, one can test as many models as desired, $M_i, i=1,…,m$.
Suppose we have two models $M_1$ and $M_2$ for data $Y$ and the two models have parameters ${\theta}_1$ and ${\theta}_2$. With prior densities $\pi_i({\theta}_i)$, with $i=1,2$, the marginal distributions of $Y$ are \begin{equation} P(y|M_i) = \int f(y|\theta_i,M_i) \pi_i(\theta_i) d\theta_i \end{equation} Then the Bayes’ theorem can be used to calculate the posterior probabilities $P(M_1|y)$ and $P(M_2|y)=1-P(M_1|y)$ for the two models.
A Bayes factor (BF) is used to summarise these results, and it is equal to the ratio of posterior odds of $M_1$ to the prior odds of $M_1$ \begin{equation} BF = \frac{P(M_1|y)/P(M_2|y)}{P(M_1)/P(M_2)}=\frac{p(y|M_1)}{p(y|M_2)} \end{equation} This equation turns out to be the ratio of observed marginal densities for the two models (note that $P(M_i)$ is the prior probability). If the two models are a priori equally probable, then \begin{equation} BF = p(M_1|y) / p(M_2|y) \end{equation} which is the posterior odds of $M_1$.
The interpretation of BF is that it captures the change in the odds in favour of model 1 as we move from the prior to the posterior, as summarised in the following table.
BF
Strength of evidence
1 to 3
not worth more than a bare mention
3 to 20
positive
20 to 150
strong
>150
very strong

Practical: quantifying population variation¶
We now suppose that we sequenced several genomes of bears and, using the method in the previous practical, assigned each individual genotype. We now address a further question: what is the frequency of a certain allele at the population level? Be aware that we have only a sample of the entire population of bears but we want to make inferences at the whole population level.

Our sample contains information for 100 individuals with the following genotypes: 63 AA, 34 AG, 3 GG. A frequentist estimate of the frequency of G is given by: $(34+(3*2))/200=40/200=0.20$. What is the posterior distribution for the population frequency of G?
The first thing we need to do is define our likelihood model. We can imagine to randomly sample one allele from the population and each time the allele can be either G or not. This is a set of Bernoulli trials and we can use of Binomial distribution as likelihood function.
The Binomial likelihood is \begin{equation} P(k|p,n) = ( \genfrac{}{}{0pt}{}{n}{k} ) p^k(1-p)^{n-k} \end{equation} where $k$ is the number of successes (i.e. the event of sampling a G), $p$ is the proportion of $G$ alleles we have (i.e. the probability of a success), and $n$ is the number of alleles we sample. Recall that \begin{equation} (\genfrac{}{}{0pt}{}{n}{k}) = \frac{n!}{k!(n-k)!} \end{equation} Note that the combinatorial term does not contain $p$.
What is the maximum likelihood estimate of $p$? You may recall that it is $\hat{p}=\frac{k}{n}$. Note that the combinatorial terms does not affect this estimate.
The second thing we need to do is define a prior probability for $p$. What is the interval of values that $p$ can take? It is $[0,1]$, as we express frequencies relative to the whole population/sample. It is convenient to choose a prior distribution which is conjugate to the Binomial. A Beta distribution is a conjugate prior in this case.
Are certain values of $p$ more likely to occur without observing the data? If we assume that it is not the case, can we use the Beta distribution to generate a noninformative prior? We can choose $Beta(\alpha=1,\beta=1)$, which is defined as \begin{equation} P(p) = \frac{1}{B(\alpha,\beta)} p^{\alpha-1}(1-p)^{\beta-1} \end{equation} where $\frac{1}{B(\alpha,\beta)}$ is simply a normalisation term which does not depend on $p$.
The full model can be expressed as $P(p|k,n) \approx P(k|p,n)P(p)$. What is the closed form for the posterior distribution given our choices for the likelihood and prior functions? It is \begin{equation} P(p|k,n) \approx p^{k+\alpha-1}(1-p)^{n-k+\beta-1} \end{equation}. The posterior distribution (Beta-Binomial model) is a Beta distribution with parameters $k+\alpha$ and $n-k+\beta$. If we set $\alpha=\beta=1$ then $P(p|k,n)=Beta(k+1,n-k+1)$. What is $k$ and $n$ in this example? $n$ is the number of alleles we sample and $k$ is the occurence of allele $G$ in our sample.
A)¶
Plot this posterior probability in R. Then calculate the maximum a posteriori value, 95\% credible intervals, and notable quantiles. What happens to the distribution if we have only 10 samples (with the sample allele frequency of 0.20)?
Now let’s think of a more informative prior. The genome-wide distribution of allele frequencies for human populations as a particular shape. This is called a site frequency spectrum (SFS) or allele frequency spectrum (AFS).

We can have another view at it by plotting the minor allele counts (MAC) distribution.

Does this distribution fit with a uniform prior? Can we use a conjugate (Beta) function to model this distribution? For instance, choosing $\alpha=0.5$ and $\beta=2$ will put more weights on low-frequency variants. However, we don’t know a priori whether the allele we are interested in is the minor allele. Therefore a prior distribution with more density at both low and high frequencies might be more appropriate. For instance, this could be achieved by setting $\alpha=0.1$ and $\beta=0.1$.
B)¶
Recalculate the posterior distribution of $p$ using an informative prior (make your own choices regarding the parameter for the Beta distribution) both in the case of 100 and 10 samples. Discuss how these results compare to the previous ones obtained in point A.

Bayesian computation¶
As previously discussed, the calculation of posterior distributions often involves the evaluation of complex high-dimensional integrals (e.g. the denominator of Bayes’ formula). This is particularly true when a conjugate prior is not available or appropriate. The two ways of addressing this issue are through (i) asymptotic methods for approximating the posterior density and (ii) numerical integration.
Asymptotic methods¶
When there are many data points, the likelihood will be peaked and the posterior distribution will not be affected by the prior much. Therefore, small changes in the prior will have little effect on the posterior and the likelihood will be concentrated in a small region.
Let’s recall again the case of the Beta-Binomial model of population allele frequencies. If we have more data (sampled alleles) then the posterior (and the likelihood) will be skewer. This can be easily visualised using R.
In [ ]:
# Beta-Binomial model of population allele frequencies
p <- seq(0, 1, 0.01) # 1000 chromosomes (200 G alleles) k <- 200 n <- 1000 alpha <- k+1 beta <- n-k+1 plot(x=p, y=dbeta(p, shape1=alpha, shape2=beta), ylab="Posterior density" , xlab="Population frequency of G", type="l") # 100 chromosomes (20 G alleles) k <- 20 n <- 100 alpha <- k+1 beta <- n-k+1 points(x=p, y=dbeta(p, shape1=alpha, shape2=beta), type="l", lty=2) # 10 chromosomes (2 G alleles) k <- 2 n <- 10 alpha <- k+1 beta <- n-k+1 points(x=p, y=dbeta(p, shape1=alpha, shape2=beta), type="l", lty=3) legend("topright", legend=c(1e3,1e2,1e1), lty=1:3) In a scenario where many data points are available, $P(\theta|y)$ will be approximately Normally distributed. This is given by the Bayesian Central Limit Theorem. More formally, for large data points $n$ the posterior can be approximated by a Normal distribution with mean equal to the posterior mode and (co)variance (matrix) equal to minus the inverse of the second derivative matrix of the log posterior evaluated at the mode. For instance, recalling our Beta-Binomial model for population allele frequencies, if we use a flat Beta prior, we have $P(\theta|x) \approx \theta^x(1-\theta)^{n-x}$. The Normal approximation follows the procedure: • take the log: $l(\theta)=x\log\theta+(n-x)\log(1-\theta)$, • take the derivative of $l(\theta)$ and set it to zero, obtaining $\hat{\theta}^\pi=\frac{x}{n}$, • take the second derivative evaluated at $\hat{\theta}$, $-\frac{n}{\hat{\theta}}-\frac{n}{1-\hat{\theta}}$, • take the minus inverse, $\frac{\hat{\theta}(1-\hat{\theta})}{n}$, • $P(\theta|x) \sim N(\hat{\theta}^\pi, \frac{\hat{\theta}(1-\hat{\theta})}{n})$. Remember that the derivative of $log(x)$ is $1/x$ and that the derivative of $log(1-x)$ is $1/(x-1)$. Let's plot the exact posterior distribution and the Normal approximation. In [ ]: # Beta-Binomial model of population allele frequencies p <- seq(0, 1, 0.01) # 100 chromosomes (20 G alleles) k <- 20 n <- 100 # exact posterior with flat prior alpha <- k+1 beta <- n-k+1 plot(x=p, y=dbeta(p, shape1=alpha, shape2=beta), ylab="Density", xlab="Population frequency", type="l") # Normal approximation thetaHat <- k/n var <- thetaHat*(1-thetaHat)/n points(x=p, y=dnorm(p, mean=thetaHat, sd=sqrt(var)), type="l", lty=2) legend("topright", c("exact","approx"), lty=1:2) You can see how the modes are very similar but the approximated curve fails to capture the asymmetry of the tails. Other normal approximations are used. If the prior is flat, then we can replace the posterior mean by the MLE. Alternatively, we can replace the posterior mode by the posterior mean. These approximations are called model approximations or first order approximations as they estimate $\theta$ by the mode and the error goes to $0$ at a rate proportional to $1/n$. Estimates of posterior moments and quantiles can be obtained simply as the corresponding features of the approximated Normal density. However, the estimates of moments and quantiles may be poor if the posterior differs from normality. Laplace's Method provides a second order approximation to the posterior mean, with an error that decreases at a rate $1/n^2$. In the previous example, we can easily compare the exact and first order approximated quantiles in R. In [ ]: options(signif=2) # data k <- 20 n <- 100 # exact posterior with flat prior alpha <- k+1 beta <- n-k+1 exact <- rbeta(1e5, shape1=alpha, shape2=beta) signif(quantile(exact),3) # Normal approximation thetaHat <- k/n var <- thetaHat*(1-thetaHat)/n approx <- rnorm(1e5, mean=thetaHat, sd=sqrt(var)) signif(quantile(approx),3) The advantages of asymptotic methods are: • they replace numerical integration with numerical differentiation, • they are deterministic (without elements of stochasticity), • they reduce the computational complexity if any study of robustness (how sensitive are our conclusions to changes in the prior/likelihood?). Asymptotic methods have also disadvantages as • they require that the posterior is unimodal, • they require that the size of the data is large (how large is "large enough"?), • for high high-dimensional parameters the calculation of Hessian matrices (second derivatives) is hard. Researchers now use iterative methods based on Monte Carlo sampling, which are longer to run but more general and relatively easy to implement. Non-iterative Monte Carlo methods¶ Direct sampling of the posterior density can be done using a Monte Carlo integration. Suppose that ${\theta} \sim h({\theta})$ with $h({\theta})$ being a posterior distribution. Again, $\theta$ can be a vector of parameters. We aim to estimate $\gamma$, the posterior mean of $c({\theta})$, with $\gamma \equiv \mathrm{E}[c({\theta})] = \int c({\theta}) h({\theta}) d{\theta}$. If ${\theta}_1, {\theta}_2, ..., {\theta}_N$ are independent and identically distributed (iid) as $h({\theta})$, then \begin{equation} \hat{\gamma}=\frac{1}{N}\sum_{i=1}^N c(\theta_i) \end{equation} which converges to $\mathrm{E}[c({\theta})]$ with probability 1 as $N \rightarrow \infty$. The computation of posterior expectations requires only a sample of size $N$ from the posterior distribution. In contrast to asymptotic methods, accuracy improves with $N$ the Monte Carlo sample size (which we can choose and have control upon) rather than $n$ the size of the data set (which can may not be able to control). With higher dimensionality of ${\theta}$, more samples are needed but the structure remains the same. The variance of $\hat{\gamma}$ can be estimated from the sample variance of the $c({\theta}_i)$ values. The standard error estimate for $\hat{\gamma}$ is \begin{equation} \hat{se}(\hat{\gamma}) = \sqrt[]{ \frac{1}{N(N-1)} \sum_{i=1}^N [c({\theta_i})-\hat{\gamma}]^2 } \end{equation} The Central Limit Theorem implies that $\hat{\gamma} \pm 2 \hat{se}(\hat{\gamma})$ provides the approximated $95\%$ confidence interval. $N$ can be chosen as large as necessary to provide a desirable confidence interval. In the univariate case, a histogram of the sampled $\theta_i$ estimates the posterior itself, as the probability of each bin converges to the true bin probability. Indeed, an estimate of $p \equiv P\{a What happens if we can't directly sample from this distribution? There are methods for indirect sampling of the posterior distribution. The most commonly used ones are (i) importance sampling, (ii) rejection sampling, (iii) weighted bootstrap. In the importance sampling, an importance function is derived to approximate the Normalised likelihood times the prior, and a weight function is then used for sampling. In the weighted bootstrap, instead of resampling from the set $\{\theta_1,...,\theta_N\}$ with equal probabilities, some points are sampled more often than others because of unequal weighting. In the rejection sampling, a smooth density called the envelope function is used to "cover" rather than approximate the posterior distribution. Suppose we can identify an envelope function $g({\theta})$ and a constant $M>0$ such that $L({\theta})\pi({\theta})
• generate $\theta_i \sim g(\theta)$,
• generate $U \sim Uniform(0,1)$,
• if $MUg(\theta_i)
• repeat this procedure until $N$ samples are obtained. The members of this sample will be random variables from $h(\theta)$.
The intuition for the rejection sampling algorithm is given in the figure below. The idea behind is that it is typically hard to sample from the true posterior but it is much easier to sample from the envelope function.

Let’s make a simple example in R to approximate a Beta distribution using a uniform envelope function.
In [ ]:
# calculate density from true posterior
calcTrueDensity <- function(x) dbeta(x, 3, 10) # g is a uniform prior, M is the maximum density value of the posterior (if known) x <- seq(0,1,0.01) epsilon <- 1e-3 M <- max(calcTrueDensity(x)) + epsilon thetas <- c() # we want N samples N <- 1e4 rawDensity <- rbeta(N, 3, 10) while (length(thetas) < N) { theta_j <- runif(1, 0, 1) U <- runif(1, 0, 1) if (M*U < calcTrueDensity(theta_j)) thetas <- c(thetas, theta_j) } # check with a qq-plot qqplot(rawDensity, thetas) abline(0,1) Markov chain Monte Carlo methods¶ All previous methods are non-iterative as they draw a sample of fixed size $N$. There is no notion of "convergence" in these methods but rather we require $N$ to be sufficiently large. For many problems with high dimensionality it may be difficult to find an importance sampling density that, for instance, is acceptable to approximate the (log) posterior distribution. In these cases, it is standard practice to use Markov chain Monte Carlo (MCMC) methods. The rationale is that these methods can sequentially sample parameter values from a Markov chain whose stationary distribution is the desired posterior distribution. A Markov process is a mathematical object following a stochastic (random) process, usually defined as a collection of random variables. A Markov process has the property that the next value of the process depends only on the current value, and it is independent from the previous ones. In other words, the future value will depend only on the current state. A Markov chain is a Markov process that has a particular type of state space, which dictates the possible values that a stochastic process can take. The figure below depicts a diagram of a two-state Markov process, with the states labelled E and A. Each number represents the probability of the Markov process changing from one state to another state, with the direction indicated by the arrow.  The great increase of generality of MCMC methods comes at the price of requiring an assessment of convergence of the Markov chain to its stationary distribution. The stationary distribution is the probability distribution to which the process converges for large values of steps, or iterations. Convergence is usually assessed using plots or numerical summaries of the sampled distribution from the chain. The majority of Bayesian MCMC computation is based on two algorithms: the Gibbs sampler and the Metropolis-Hastings (M-H) algorithm. Gibbs sampler¶ Let's suppose that our model has $k$ parameters so that $\vec{\theta}=(\theta_1, \theta_2, ..., \theta_k)$. We also assume that we can sample from the full conditional distributions. The collection of full conditional distributions uniquely determines the joint posterior distribution $P(\vec{\theta},{y})$ and all marginal posterior distributions $P(\theta_i,{y})$, for $i=1,...,k$. Given an arbitrary set of starting values $\{\theta_2^{(0)}, ..., \theta_2^{(k)}\}$, for $(t=1, ..., T)$, the algorithm: • draws $\theta_1^{(t)}$ from $P(\theta_1 | \theta_2^{(t-1)}, \theta_3^{(t-1)}, ..., \theta_k^{(t-1)}, y )$ • draws $\theta_2^{(t)}$ from $P(\theta_2 | \theta_1^{(t)}, \theta_3^{(t-1)}, ..., \theta_k^{(t-1)}, y )$ • ... • draws $\theta_k^{(t)}$ from $P(\theta_k | \theta_1^{(t)}, \theta_2^{(t)}, ..., \theta_{k-1}^{(t)}, y )$ Under most conditions, $(\theta_1^{(t)}, \theta_2^{(t)}, ..., \theta_k^{(t)})$ converges to a draw from the true joint posterior distribution $P(\theta_1, \theta_2, ..., \theta_k | y)$. This implies that for sufficiently large $t>t_0$ then $\{\theta^{(t)}, t=t_0+1, …, T\}$ is a correlated sample from the true posterior distribution.
A histogram of $\{\theta_i^{(t)}, t=t_0+1, …, T\}$ provides an estimate of the marginal posterior distribution for $\theta_i$. The posterior mean can be estimated as \begin{equation} \hat{\mathrm{E}}(\theta_i|y) = \frac{1}{T-t0} \sum{t=t_0+1}^T \theta_i^{(t)} \end{equation} The time $0<=t<=t_0$ is called the burn-in period. To speed up the computational time, we can run $m$ chains (typically $m \leq 5$ and chains run in parallel). This procedure is also useful for assessing convergence. In this case, the posterior mean is estimated as \begin{equation} \hat{\mathrm{E}}(\theta_i|y) = \frac{1}{m(T-t0)} \sum{j=1}^m \sum_{t=t0+1}^T \theta{i,j}^{(t)} \end{equation} The entire marginal posterior density of $\theta_i$ is estimated as \begin{equation} \hat{\mathrm{E}}(\theta_i|y) \approx \frac{1}{m(T-t0)} \sum{j=1}^m \sum_{t=t0+1}^T P(\theta{i} | \theta_{l \neq i,j}^{(t)}, y) \end{equation} A minimal requirement for convergence is that the parameter space must be fully connected, without "holes". Imagine a joint posterior distribution for two univariate parameters $\theta$ and $\nu$ with two disconnected regions of support. The posterior is either defined for $(\theta>0 \text{ and } \nu>0$ or for $(\theta<0 \text{ and } \nu<0$. If we choose $\theta^{(0)}>0$ then we will have $\mu^{(1)}>0$ and subsequentially $\theta^{(1)}>0$ and so on. The chain will not be able to “escape” the first quadrant.
An additional issue appears when $\theta$ and $\nu$ are highly correlated as this will lead to autocorrelation. The chain will therefore have a “slow mixing” and might get trapped in one part of the joint distribution. A solution is called thinning and relies on retaining only every $m^{th}$ itetation. If $m$ is large enough, then the samples will be uncorrelated.
Metropolis-Hastings algorithm¶
To ensure that all the full conditional distributions are available, the prior distribution of each parameter can be chosen to be conjugate to the corresponding likelihood. When the priors and likelihoods are not conjugate pairs, one of more of these conditional probabilities may not be available in closed form. Nevertheless, even under these conditions the full conditional probabilities will be available up to a proportionality constant.
The Metropolis algorithm is a rejection algorithm which requires only a function proportional to the distribution to be sampled, at the cost of a rejection step from a candidate density function. In this algorithm we can generate samples from a joint posterior distribution $h(\vec{\theta})$ such as $P(\vec{\theta}|y) \propto h(\vec{\theta}) \equiv f(y|\vec{\theta})\pi(\vec{\theta})$.
The algorithm begins by proposing a candidate, or proposal, symmetric density $q(\vec{\theta}^* | \vec{\theta}^{(t-1)})$ which satisfies $q(\vec{\theta}^* | \vec{\theta}^{(t-1)})=q(\vec{\theta}^{(t-1)} | \vec{\theta}^*)$. From a starting value $\vec{\theta^{(0)}}$ at iteration $t=0$, for $(t=1,…, T)$ the algorithm repeats:
• draw $\vec{\theta}^* = q( \cdot | \vec{\theta}^{(t-1)})$,
• calculate $r=h(\vec{\theta}^*)/h(\vec{\theta}^{(t-1)})$,
• if $r \geq 1$, set $\vec{\theta}^{(t)}=\vec{\theta}^*$, otherwise .. set $\vec{\theta}^{(t)}=\vec{\theta}^$ with probability $r$ or ..* set $\vec{\theta}^{(t)}=\vec{\theta}^{(t-1)}$ with probability $1-r$. Under mild assumptions, $\vec{\theta}^{(t)}$ converges in distribution to a draw from the true posterior density $p(\vec{\theta}|y)$.
The Metropolis algorithm is flexible in the selection of the candidate density $q$ but may be less efficient than the Gibbs sampler when not properly tuned. The typical approach is to set the candidate density as: \begin{equation} q(\vec{\theta}^|\vec{\theta}^{(t-1)}) = N(\vec{\theta}^|\vec{\theta}^{(t-1)}, \tilde{\Sigma}) \end{equation} This distribution is symmetric and is “self-correcting” as candidates are always centered around the current value of the chain. As such, this approach is also called random walk Metropolis.
The posterior variance is represented by $\tilde{\Sigma}$ which can be empirically estimated from a preliminary run. A skewed $q$ density will increase the acceptance rate but also generate more autocorrelated samples and, therefore. it may explore only a small proportion of the parameter space. A rule of thumb is to choose $\tilde{\Sigma}$ so that around $50\%$ of the candidates are accepted. However, often the choice of $\tilde{\Sigma}$ is done \textit{adaptively}. Indeed, one can keep track of the proportion of accepted candidates and tune $\tilde{\Sigma}$ accordingly. This is usually done during the burn-in period and it is called pilot adaptation.
A generalisation of the Metropolis algorithm drops the requirement that the candidate density must be symmetric. For instance, for bounded parameter spaces (e.g. $\theta>0$) a Gaussian density is not appropriate. In the Metropolis-Hastings algorithm when $q(\vec{\theta}*|\vec{\theta}^{(t-1)}) \neq q(\vec{\theta}^{(t-1)}|\vec{\theta}*)$ we replace the acceptance rate $r$ by \begin{equation} r = \frac{ h(\vec{\theta}^) q(\vec{\theta}^{(t-1)}|\vec{\theta}^) }{ h(\vec{\theta}^{(t-1)}) q(\vec{\theta}^*|\vec{\theta}^{(t-1)}) } \end{equation} Again, a draw $\vec{\theta^(t)}$ converges in distribution to a draw from the true posterior density as $t \rightarrow \infty$.
An alternative approach is to set $q(\vec{\theta}^*|\vec{\theta}^{(t-1)})=q(\vec{\theta}^*)$, where the proposal ignores the current value of the variable. This algorithm is called Hastings independence chain. In this case, the acceptance rate becomes \begin{equation} r = \frac{h(\vec{\theta}^)/q(\vec{\theta}^)}{h(\vec{\theta}^{(t-1)})/q(\vec{\theta}^{(t-1)})} \end{equation} which is the weight function in the importance sampling.
There are many variants and types of MCMC algorithms. In the Langevin-Hastings algorithm we introduce a systematic drift in the candidate density. Another approach is to use auxiliary variables to expand the parameter space, as in the slice sampler algorithm. An appropriate enlargement of the parameter space can broaden the class of distributions we can sample and accelerate convergence. A nice feature of MCMC algorithms is that they can be combined into a single problem using hybrid forms, resulting in a mixture of algorithms. Finally, adaptive algorithms attempt to use the early output from a chain to refine and improve the sampling as it progresses.
Convergence¶
Convergence is an important issue for MCMC algorithms as their output is random and autocorrelated. When the output is safely thought to come from a true stationary distribution of the Markov chain for all $t>T$ then the MCMC algorithm has converged at time $T$. There are both theoretical basis and diagnostic tools to assess whether the chain has indeed reached convergence.
A possible diagnostic strategy is to:
• run few parallel chains with starting points drawn from an overdispersed (wide) distribution with respect to the stationary distribution;
• visually inspect these chains on a common graph for each parameter (as shown in the figure below);
• for each graph calculate the scale reduction factor (to check whether the variation within chains are approximately equal to the total variation);
• investigate crosscorrelations among parameters suspected of being confounded. 
Software¶
There are several software that implement MCMC algorithms for generating samples from posterior distributions. A commonly used program is OpenBUGS, which is the free and open-source version of WinBUGS. The BRugs package in R calls WinBUGS. Finally, JAGS and its R interface rjags are valid alternatives.

Approximate Bayesian Computation¶
Bayesian computation involves modelling the joint density of parameters values $\theta$ and data $x$. The aim is then to compute the posterior probability \begin{equation} P(\theta|x) = \frac{f(x|\theta)\pi(\theta)}{m(x)} \end{equation} which can be difficult as the marginal likelihood \begin{equation} m(x) = \int f(x|\theta)\pi(\theta)d\theta \end{equation} can involve a high dimensional integral. If the likelihood can be evaluated up to a normalising constant, Monte Carlo methods can be used to sample from the posterior.
As the models become more complicated, the likelihood function becomes difficult to define and compute. Under these circumstances it is easier to simulate data samples from the model given the value of a parameter. If the data are discrete and of low dimensionality, then it is possible to sample from the posterior density of the parameter without an explicit likelihood function and without approximation.
Rejection algorithm¶
Under the discrete case and low dimensionality, the algorithm to sample from the posterior from simulations is the following. Given observation $y$, repeat the following until $N$ points have been accepted:
• Draw $\theta_i \sim \pi(\theta)$
• Simulate $x_i \sim f(x|\theta_i)$
• Reject $\theta_i$ if $x_i \neq y$ These are sampled from $P(\theta|x)$.
The posterior distribution gives the probability distribution of the parameter values that gave rise to the observations. To calculate summaries of this distribution it is possible to draw a histogram and derive notable percentiles.
Let’s use again the example of elephant herds. Suppose we observe $4$ herds arriving, so $y=4$. Recalling how we modelled this process, the likelihood function was Poisson-distributed with a Gamma-shaped prior $G(3,1)$. The posterior distribution was Gamma distributed with shape parameter $3+4=7$ and scale $0.5$. Let’s assume that we don’t know the posterior distribution as we cannot derive it. However, we assume that we know how to simulate $y$ given a certain value of our parameter $\theta$, the average arrival of herds per day.
Let’s write some R code to sample from the posterior in a likelihood-free way. You need to source some R functions in the Data folder to perform simulations first.
In [1]:
# load some already prepared functions
source(“Data/abc.R”)

# Rejection algorithm
N <- 1e4 y <- 4 # function to simulate is called "simulatePoisson" thetas <- c() while (length(thetas) <= N) { # 1. draw from prior (continuous, bounded, uniform) theta <- runif(1,0,20) # 2. simulate observations ysim <- simulatePoisson(theta) # 3. accept/reject if (ysim == y) thetas <- c(thetas, theta) } hist(thetas) quantile(thetas, c(0.025,0.25,0.5,0.75,0.975)) 2.5% 1.61607461050153 25% 3.38704109191895 50% 4.68384738080204 75% 6.2685907445848 97.5% 10.1489826198667  Here we assume that we don't know the likelihood function but we can simulate data under this unknown function. What happens in the continuous case? If the data are of low dimension we can modify the previous algorithm as it follows: • Draw $\theta_i \sim \pi(\theta)$ • Simulate $x_i \sim p(x|\theta_i)$ • Reject $\theta_i$ if $\rho(x_i,y) > \epsilon$ where $\rho(\cdot)$ is a function measuring the distance between simulated and observed points.
Let’s make a further example, recalling the case of water temperature at Bumpass Hell. $\theta$ is continuous and according to our previous example it has a prior distribution $U(80,110)$. We assume we don’t know the likelihood function (which is normally distributed) but we can simulate observations that are distributed according to it. Finally, we assume we have an observation of the temperature $y=91.3514$.
What is $\rho(\cdot)$? For instance we can use the Euclidean distance \begin{equation} \rho(x_i, y) = \sqrt[]{(x_i-y)^2} \end{equation} Let’s write some R code to estimate $\theta$ using this algorithm.
In [6]:
# Rejection algorithm in the continuous case
N <- 1e3 y <- 91.3514 epsilon <- 1e-1 # function to simulate is called "simulateNormal" # euclidean distance rho <- function(x,y) sqrt((x-y)^2) thetas <- c() while (length(thetas) <= N) { # 1. draw from prior (continuous, bounded, uniform) theta <- runif(1, min=50, max=150) # 2. simulate observations ysim <- simulateNormal(theta) # 3. accept/reject if (rho(ysim,y)<=epsilon) thetas <- c(thetas, theta) } print(mean(thetas)) hist(thetas) quantile(thetas, c(0.025,0.25,0.5,0.75,0.975)) [1] 91.26865 2.5% 84.6583034843206 25% 89.2081973841414 50% 91.4858682546765 75% 93.3908108621836 97.5% 97.5261773215607  You can appreciate that the more the prior is different from the unknown likelihood function, the lower the acceptance rate. An alternative to choose a value for $\epsilon$ is to rank all distances and select only a proportion of the lowest ones. In this case one sets the number of simulations to be performed (not the number of accepted simulations) and the proportions of simulations to retain. It is convenient to investigate the distribution of ranked distances to be sure to retain true outliers in the distribution. TASK Recalling the previous example, write some R code to estimate $\theta$ assuming that • our observations are $Y=\{91.34, 89.21, 88.98\}$ • $\theta$ has prior $N(\mu=90,\sigma^2=20)$ defined only in $80 \geq \theta \leq 110$ • the simulating function is called "simulateNormal" in "Data/abc.R" • the distance function is $\rho(x_i, Y)=\frac{\sum_{j \in Y} \sqrt[]{(x_i-j)^2}}{|Y|}$ • the total number of simulations is $10,000$ and we want to accept the lowest $5\%$ of distance Complete the following tasks: • plot the sampled prior distribution • plot the distribution of ranked distances with indication of $5\%$ threshold • plot the posterior distribution • calculate notable quantiles and HPD $95\%$ (using the library coda and function HPDinterval(as.mcmc(x), prob=0.95)) Finally, what happens if we have more observations? In [ ]: # Rejection algorithm with proportions of simulations to accept # ... When the data become high dimensional (e.g. multivariate measurements) then it is necessary to reduce the dimensionality via the use of summary statistics. For instance, the complete genome of many samples has high dimensionality as it may have up to $N*L$ genotypes with $N$ samples and $L$ number of sites per-genome. One can calculate summary statistics $S(y)$ to describe some features of the data (e.g. indexes of genetic diversity in the case of multiple genomes). In this case, the following is the prototype for the rejection-ABC algorithm, where ABC stands for Approximate Bayesian Computation. Given observation $y$, repeat the following until $N$ points have been accepted: • Draw $\theta_i \sim \pi(\theta)$ • Simulate $x_i \sim p(x|\theta_i)$ • Reject $\theta_i$ if $\rho(S(x_i),S(y))>\epsilon$ 
The function $S(\cdot)$ can be a vector. The choice of summary statistics is considered as mapping from a high dimension to a low dimension. Some information is lost, but with enough summary statistics much of the information is kept. The aim for the summary statistics is to satisfy the Bayes sufficiency \begin{equation} P(\theta|x)=P(\theta|S(x)) \end{equation}
The first example of an ABC approach was introduce by Pritchard et al. (1999). They summarised information for 445 Y-chromosome genes copies at eight microsatellites (and therefore 445 times 8 dimensions) into three numbers. The distance was chosen to be a normalised Chebyshev distance \begin{equation} \max_j | \frac{S_j(x)}{S_j(y)} – 1| \end{equation} for $j=1,…,s$ summary statistics.
You can clearly see one of the first issues in ABC, related to the curse of the dimensionality when using more than a few summary statistics. If summary statistics are uncorrelated, using the above distance, we will reject many simulations with increasing number of summary statistics. Solutions have been proposed in order to (i) use a wider acceptance tolerance and/or (ii) perform a better sampling from the prior.
Regression-based estimation¶
Another possibility to derive posterior using ABC is based on local linear regression, in order to obtain a potentially wider set of accepted points. The algorithm is the following:
• Given observation $y$ repeat the following until $M$ points have been generated:
▪ Draw $\theta_i \sim \pi(\theta)$
▪ Simulate $x_i \sim f(x|\theta_i)$
• Calculate $S_j(x)$ for all $j$ and $k_j$, the empirical standard deviation of $S_j(x)$
• $\rho(S(x),S(y)):\sqrt[]{\sum_{j=1}^s ( \frac{S_j(x)}{k_j} – \frac{S_j(y)}{k_j} )^2 }$
• Choose tolerance $\epsilon$ such that the proportion of accepted points $P_\epsilon=\frac{N}{M}$
• Weight the simulated points $S(x_i)$ using $K_\epsilon(\rho(S(x_i),S(y)))$ where \begin{align} K\epsilon(t) &= \epsilon^{-1}(1-(t/\epsilon)^2) & \text{for } t \leq \epsilon \ K\epsilon(t) &= 0 & \text{for } t > \epsilon \end{align}
• Apply weighted linear regression to the $N$ points that have nonzero weight to obtain an estimate of $\hat{E}(\theta|S(x))$
• Adjust $\theta_i^*=\theta_i-\hat{E}(\theta|S(x))+\hat{E}(\theta|S(y))$
• The $\theta_i^*$ with weights $K_\epsilon(\rho(S(x_i),S(y)))$ are random draws from an approximation of the posterior distribution $P(\theta|y)$.
There are problems with regression-based methods too. When the observed summary statistics lies outside the unknown likelihood distribution (model misspecification), then regression is an extrapolation rather than an interpolation. In these cases posterior draws (after regression adjustments) can be outside the prior range. This problem occurs when the observations lie at the boundaries of the unknown likelihood (called prior-predictive distribution in the ABC context).
MCMC-ABC¶
Another possibility to increase the performance of ABC estimation is to do a better sampling. Indeed, the great majority of simulated parameter values may not give rise to summary statistics that are similar enough to the observed data. Efficiency will be slow as many points will be rejected or given negligible weight. We therefore want a procedure whereby parameters are sampled from a distribution that is closer to the posterior than from the prior. There are two main ways to do this, one via Markov Chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) sampling.
An MCMC-ABC algorithm starts by initialising sampling $\theta^{(0)} \sim \pi(\theta)$. Then at iteration $t \geq 1$:
• Simulate $\theta’ \sim K(\theta|\theta^{(t-1)})$ where $K(\cdot)$ is a proposal distribution that depends on the current value of $\theta$
• Simulate $x \sim p(x|\theta’)$.
• If $\rho(S(x),S(y))<\epsilon$ (rejection step), ▪ $u \sim U(0,1),$ ▪ if $u \leq \pi(\theta')/\pi(\theta^{(t-1)}) \times K(\theta^{(t-1)}|\theta')/K(\theta'|\theta^{(t-1)})$, update $\theta{(t)}=\theta'$, otherwise $\theta{(t)}=\theta^{(t-1)}$; • otherwise $\theta{(t)}=\theta^{(t-1)}$. A good proposal distribution should resemble the actual posterior distribution of the parameters. A Normal proposal distribution often works well in practice, centred in $\theta^{(t-1)}$. This is also called the jumping distribution. In this algorithm, at convergence the average distribution of proposed $\theta'$ is dominated by the posterior itself. It is also possible to apply any regression-adjustment methods on the MCMC sample to obtain more accurate estimates. Compared to the classic MCMC with likelihoods, this algorithm has higher rejection rate. To circumvent this problem, the tolerance $\epsilon$ can be initially high and then reduced during the burn-in phase. It was later proposed a method called Sequential Monte Carlo (SMC) for iteratively improving on an ABC approximation. This approach consisted of two main features: (i) weighted resampling from the set of points already drawn and (ii) successive reduction in the tolerance $\epsilon$. Model assessment in ABC¶ Classical Bayesian inference can be applied to models and parameters in the ABC framework. Model choice¶ Given a series of model $\mu_1, \mu_2, ..., \mu_N$ with prior probabilities $\sum_i \pi(\mu_i)=1$, we can calculate Bayes factors between two models $i$ and $j$ as \begin{equation} \frac{p(\mu_i|x)}{p(\mu_j|x)} \div \frac{p(\mu_i)}{p(\mu_j)}
\end{equation} Typically, Bayes factors can be computed only if the parameters within the models have priors that integrate to one. Therefore, Bayesian model choice can be strongly affected by the prior. Notably, Bayesian model choice automatically penalised models with many parameters. As such, one does not need to account for different number of parameters between models. Hierarchical model¶ ABC can also be adopted in a hierarchical Bayesian model. A potential difficulty here is that summary statistics should capture information from each unit so that the hyperparameters can be well inferred. However, if there are many summary statistics it is unlikely that simulated data will closely match the observations. Choice of summary statistics¶ A very much arbitrary area of ABC modelling lies in the choice of summary statistics. In some fields, there is a history of relating summary statistics to model parameters. In general, there is no need of a strong theory relating summary statistics to model parameters. One issue here is the effect of summary statistics on inferences and whether some choices may bias the outcome of model choice. This may happen if chosen summary statistics have little relation to parameters in other models. Typically, some summary statistics may cover some aspects of the model while other statistics may cover different aspects, making the choice of a finite set of informative units problematic. The main idea is that as more summary statistics are used, then they should be jointly sufficient for the likelihood. Also, summary statistics may be correlated to each other and to the parameters. However, the accuracy and stability of ABC decreases rapidly with increasing numbers of summary statistics.  How can we choose the optimal set of summary statistics? For instance, one could calculate the ratio of posterior density with or without a particular summary statistic. Departures greater than a threshold are suggestive that the excluded summary statistic is important. Alternatively, different summary statistics can be weighted differently according to their correlation with some model parameters. The number of summary statistics can also be reduced via multivariate dimensional scaling (e.g. using partial least-squares or principal component analysis). Finally, summary statistics should be scaled in order to have equal mean and variance, if normally distributed, to avoid putting a different weight to sparser distributions. Model validation¶ A very important component of Bayesian modelling is validation and testing. Validation is the assessment of goodness-of-fit of the model and comparing alternative models. In ABC it is essential to distinguish errors due to the approximation from errors caused by the choice of the model. If available, one can compare the results of a simulation with expectations based on the theory. Often the marginal (or joint) distribution of simulated summary statistics are visualised and compared to the corresponding target statistic. If the target is outside, then this could be a problem in the model. This issue does not occur in likelihood-based approaches. A similar test is to compare the observations with the posterior predictive distribution. This can be done by simulating data with parameters drawn randomly from the current posterior distribution. Applications of ABC in ecology and evolution¶ The initial applications of ABC have been mainly in population genetics, using a rejection or regression algorithm. Later on, a number of other areas in ecology, epidemiology and systems biology have seen an increase in the use ABC. The more recent applications use MCMC or SMC algorithms. In population genetics, the data consists of frequencies of alleles or haplotypes in one or more populations. The goal is usually to estimate the demographic history of populations in terms of changes of population sizes, divergence times, migration rates, and so on. A number of studies on inferring human evolution have been using ABC methods. For instance, Patin et al. (2009) compared different demographic models that explain the genetic differentiation within different African populations.   Some features of ecology, epidemiology and systems biology appear to be very similar. Many aspects are captured by systems of partial or ordinary differential equations or stochastic differential equations. Data often consist of time series and/or spatial data. The goal here is to compare between hypothesised models that could explain the observed patterns and to infer parameters. Toni et al. (2009) provide an example using a Lotka-Volterra system on prey-predator dynamics from time series data on abundances.  ABC has also been used for agent-based models, protein interaction networks, speciation rates under a neutral ecological model, extinction rates from phylogenetic data, epidemiology (e.g. transmission). In conclusion, when a likelihood function is known and can be efficiently evaluated, then there is not advantage to use ABC. When the likelihood function is known but difficult to evaluate in practise, the ABC is a valid alternative. Many scenarios that evolutionary biologists or ecologists are interested to can be generated by simulations, making ABC very appealing. ABC can also be useful for initial exploratory analyses. Practical: estimating divergence time in bears¶ Preparation¶ For this practical you need some R packages, namely coda, abc, grid, maps, spam, fields. For plotting purposes you may also want to use ggplot2. You will also need the software ms to be installed. You can find the executable for linux in bin/ms. If it doesn't work, you can compile the source Software/ms.tar.gz by tar -xzvf ms.tar.gz; cd msdir; gcc -o ms ms.c streec.c rand1.c -lm. Finally you need some data and R functions provided in Data. I suggest to copy functions.R and polar.brown.sfs* in the workspace where you will run this practical without overwriting the repository. If you encounter difficulties please do let me know. You can work in teams for this practical. Actually, I encourage that you team up for this exercise. Project¶ In this partical you are going to estimate the divergence (or speciation) time between polar bears and brown bears using genomics data. You will be using Approximate Bayesian Computation methods to inference such time. In [ ]: # Open R and load all R functions and data needed: source("Data/functions.R") load("Data/polar.brown.sfs.Rdata") # Inspect the objects: ls() The file polar.brown.sfs includes the joint (2 dimensions) site frequency spectrum (SFS) between polar bears (on the rows) and brown bears (on the columns). This is based on real genomic data from 18 polar bears and 7 brown bears. The site frequency spectrum is a matrix $N \times M$ where cell $(i,j)$ reports the number of sites with allele frequency $(i-1)$ in polar bears and $(j-1)$ in brown bears. If you want to see this file type cat polar.brown.sfs in your terminal. In [ ]: # You can plot this spectrum: plot2DSFS(polar.brown.sfs, xlab="Polar", ylab="Brown", main="2D-SFS") Each population has $2n+1$ entries in its spectrum, with $n$ being the number of individuals. The number of chromosomes for each species (bears are diploids, like humans) can be retrieved as: In [ ]: nChroms.polar <- nrow(polar.brown.sfs)-1 nChroms.polar nChroms.brown <- ncol(polar.brown.sfs)-1 nChroms.brown The only thing you need to remember about the site frequency spectrum is that we can easily calculate several summary statistics from it. These summary statistics can be used for inferences in an Approximate Bayesian Computation (ABC) framework. For instance, from the site frequency spectrum, we can easily calculate the number of analysed sites (in this example all sites are polymorphic, and thus variable in our sample), which is simply the sum of all entries in the SFS. In [ ]: nrSites <- sum(polar.brown.sfs, na.rm=T) nrSites This value is important as we will condition the simulations to generate this number of sites for each repetition. In other words, when simulating data we will simulate exactly this number of polymoprhic sites to calculate the site frequency spectrum and all corresponding summary statistics afterwards. I provide a function to easily calculate several summary statistics from a site frequency spectrum. In [ ]: obsSummaryStats <- calcSummaryStats(polar.brown.sfs) obsSummaryStats # These are the OBSERVED summary statistics! Keep them. These are the summary statistics available in this practical and their meaning is the following: • fst: population genetic differentiation; it measures how much species are genetically different; it goes from 0 (identical) to 1 (completely different); • pivar1: genetic diversity of species 1 (polar bears); • pivar2: genetic diversity of species 1 (brown bears); • sing1: number of singletons (sites with frequency equal to 1) for species 1 (polar bears); • sing2: number of singletons (sites with frequency equal to 1) for species 2 (brown bears); • doub1: number of doubletons (sites with frequency equal to 2) for species 1 (polar bears); • doub2: number of doubletons (sites with frequency equal to 1) for species 2 (brown bears); • pef: proportion of sites with equal frequency between polar bears and brown bears; • puf: proportion of sites with unequal frequency between polar bears and brown bears (note that puf=1-pef). It is not important that you understand the significance (if any) of all these summary statistics in an evolutionary context. If interested, a nice review is "Molecular Signatures of Natural Selection" by Rasmus NielseN. However, some of these summary statistics might be more informative than others. It is your first goal to understand which summary statistics to keep. The parameter you want to estimate is the divergence time between polar and brown bears (T). You first aim is to performs N simulations of data by drawing from a prior distribution of T and record (separately) the drawn values and the corresponding summary statistics generated by that value of T. You can define how many simulations you want to perform (ideally a lot). In [ ]: nrSimul <- 1e4 # but change this accordingly Then you should define the prior distribution of our parameter to be estimated, the divergence time T. You can use any distribution you find suitable. However, you may want to consider that a reasonable range of values for T is between 200k and 700k years ago. The function to simulate data (specifically the site frequency spectrum) given values of T (and M, the migration rate) is simulate: In [ ]: simulate This function takes as parameters: T (divergence time), M (migration rate), how many sites to simulate, the directory for ms program and the text file in output. This function simulates a joint evolutionary history for both polar and brown bears according to what we know in terms of their respective changes in size. However, you can set when they speciated (T in years ago) and the migration rate (M). (Note that the migration rate is scaled by the reference population size so a reasonable range of M is between 0 and 2.) As an example, assuming T=200k and M=0 the command to simulate data and calculate summary statistics is the following: In [ ]: # first, set the path to the "ms" software you installed msDir <- "bin/ms" # this is my specific case, yours could be different # second, set the name for the output text file fout <- "ms.txt" # leave it like here # then we can simulate data: simulate(T=2e5, M=0, nrSites, msDir, fout) # and finally calculate the summary statistics for this simulation #(note that you need to specify the number of chromosomes for the two species) simulatedSFS <- fromMStoSFS(fout, nrSites, nChroms.polar, nChroms.brown) calcSummaryStats(simulatedSFS) # you can even plot the simulated site frequency spectrum plot2DSFS(simulatedSFS, xlab="Polar", ylab="Brown", main="simulated 2D-SFS") Based on the observed summary statistics 'fst', which measures how different polar and brown bears, in relation to the one calculated simulating T=2e5, can you make some initial (very rough) considerations on the most likely values of T (higher or lower than 200k years ago)? You can use the abc package and the abc function to calculate the posterior distribution (as well as to compute the distance between observed and expected summary statistics). In [ ]: #install.packages("abc") library(abc) ?abc As you can see, to perform an ABC analysis you need 3 objects: • target: a vector of the observed summary statistics; • param: a vector, matrix or data frame of the simulated parameter values; • sumstat: a vector, matrix or data frame of the simulated summary statistics. You already have 'target' as it is the vector of observed summary statistics called 'obsSummaryStatistics'. You now have everything to estimate the divergence time. For simplicity assume that $M=0$. Also, you are free to choose a rejection or local-regression method, as specified in the 'abc' function. This is not strictly required, but if you want to explore the estimation of two parameters simultaneously, you can estimate M by defining a prior for it, draw random samples jointly of T and M, calculate summary statistics, and so on. Hints¶ Please consider these points carefully when completing the project. • Assess which summary statistics are more or less informative for the parameter estimation (e.g. after a first run of simulations with all parameters, look for correlations between the simulated parameter value and summary statistics). • You can also look for correlations between summary statistics and eventually use only one of the pair if two summary statistics are highly correlated. If you are a pro, you can also perform a principal component or multidimensional scaling analysis (e.g. with package 'pls') and by using each statistic's loadings, you can create novel uncorrelated summary statistics which are linear combinations of the previous ones (this part is purely suggestive and it is not required to obtain the full score). • Remember to scale your simulated (jointly with the observed) summary statistics separately, so that the mean is zero and standard deviation is one. • Generate a plot with the posterior distribution of the parameter of interest. You can also show the chosen prior distribution on the same plot. • Calculate the posterior mean, mode, median and other notable quantities (e.g. 95\% HPD interval) to summarise the posterior distribution. • I suggest you to use the 'abc' package in R as it implements the local-linear regression method too. However you can also implement a rejection sampling method yourself, as seen in class. • A useful diagnostic plot to show is the distribution of sampled values from the prior: do they cover the whole range of the prior (and are they distributed as expected)? In [ ]: