Bayesian methods in ecology and evolution¶
https://bitbucket.org/mfumagal/statistical_inference
Intended Learning Outcomes¶
At the end of this module you will be able to:
• critically discuss advantages (and disadvantages) of Bayesian data analysis,
• illustrate Bayes’ Theorem and concepts of prior and posterior distributions,
• implement simple Bayesian methods, including sampling and approximated techniques and Bayes networks,
• apply Bayesian methods to solve problems in ecology and evolution.

Bayesian methods in ecology and evolution¶
https://bitbucket.org/mfumagal/statistical_inference
day 1a: Bayesian thinking¶
Intended Learning Outcomes¶
At the end of this part you will be able to:
• appreciate the use of Bayesian statistics in life sciences,
• formulate and explain Bayes’ theorem,
• describe a Normal-Normal model and implement it in R with or without Monte Carlo sampling,
• apply Bayesian statistics in genomics.
the eyes and the brain¶
“You know, guys? I have just seen the Loch Ness monster at Silwood Park! Can you believe that?”

What does this information tell you about the existence of Nessie?
In the classic frequentist, or likelihoodist, approach you make some inferences based on all the data that you have observed. The only data that you observe here is me telling you whether or not I saw Nessie. In other words, your inference on whether Nessie exists (at Silwood!) or not will be solely based on such observations.
ACTIVITY
Let’s denote $D$ (data) as the set of observations specifying whether I saw Nessie ($D=1$) or not ($D=0$). $D$ is our sample space, the set of all possible outcomes of the experiment, and $D = \{0,1\}$. We want to make some inferences on the probability that Nessie exists. Let’s denote this random variable as $N$ and assume for simplicity that it can take only values $0$ and $1$.
We can define a likelihood function for $P(D|N)$.
For instance, we could set $p(D=1|N=0)=0.05$ (what are the chances that I did not see Nessie, so it doesn’t exist, but I tell I saw it?) and $p(D=1|N=1)=0.99$ (what are the chances that I did see Nessie and tell you I saw it so it exists?). Likewise, let’s assume that $p(D=0|N=0)=0.95$ and $p(D=0|N=1)=0.01$.
Let’s make it slightly more complicated and imagine that even the second lecturer didn’t tell you that she/he saw Nessie and the third lecturer told you that she/he saw Nessie. Let’s assume that the likelihood function is the same for each observer/lecturer $l$.
The likelihood function is defined by these conditional probabilities:
N
D=0
D=1
N=0
0.95
0.05
N=1
0.01
0.99
and observations are D=[1,0,1].
What is the log-likelihood distribution of $N$? What is the Maximum Likelihood Estimate (MLE) of $N$? What is the likelihood ratio in favour of the hypothesis of Nessie existing?
In this very trivial example we maximise the likelihood function for $N$ and obtain $N_{MLE}=1$. The difference in likelihood between $N=1$ and $N=0$ gives us some sort of confidence level. The more data we have pointing towards one “direction”, the more confident we are of our inferences.
Recalling our previous example, with 3 observations of $D=1$ we would obtain a likelihood ratio (LR) of $8.96$. With 100 observations of $D=1$ we would obtain a LR of $\approx 300$. On the other hand, with only 1 observation, the LR would be $<3$.
We can appreciate how our inference on $N$ is driven solely by our observations, and our inference is taken by the likelihood distribution. In a very informal (and possibly wrong) notation, we can write $p(N|D) \propto p(D|N)$, where we stress the conditional of $N$ on the observed data $D$.
An analogy here can explain this concept further.
Imagine that in the likelihood approach we use only one visual (or auditive) organ, i.e. our eyes (or ears).

However, in real life, we take many decisions based not solely on what we observe but also on some believes of ours.
We usually use another organ, the brain, to make inferences on the probability of a particular event to occur.

Note that in this cartoon the brain is "blind", in the sense that it does not observe the data (no arrow pointing to the eye) but its inferences on the event are based on its own believes.
Back to the Loch Ness monster case, we can clearly have some believes whether or not Nessie exists not only because I told you I saw it in the campus. This "belief" expresses the probability of Nessie existing $p(N)$ unconditional of the data. Our intuition is that the probability of $N=1$ is somehow a joint product of the likelihood (the eyes) and the belief (the brain). Therefore, $p(N|D) \propto p(D|N)p(N)$.
How can we define $p(N)$? This depends on our blind "belief" function. If you are a Sci-Fi fan you might be inclined to set a higher probability (e.g. $p(N=0.2)$) than the one a pragmatical and sceptical person would set (e.g. $p(N=0.002)$).
As an illustration, let's assume that $p(D=1|N=0)=0.001$ and $p(D=1|N=1)=0.1$. In the "Sci-Fi brain", $p(N=1|D=1) \propto p(D=1|N=1)p(N=1) \propto 0.1 \times 0.2 = 2e-2$. In the "sceptical brain", $p(N=1|D=1) \propto 0.1 \times 0.002 = 2e-4$.
Note that these are not proper probabilities (we will see later how to calculate proper probabilities using "belief" functions). We can deduct how the choice of a different "belief" function can point us to either different conclusions or confidence levels.
In statistics, the "belief" function (e.g. $p(N)$) is called prior probability and the joint product of the likelihood ($p(D|N)$) and the prior ($p(N)$) is proportional to the posterior probability ($p(N|D)$).
The use of posterior probabilities for inferences is called Bayesian statistics.
Who, what and why?¶
Who¶
Bayesian statistics is called after Thomas Bayes (1701–1761), an English statistician, philosopher and Presbyterian minister.

Thomas Bayes never published his famous accomplishment in statistics. (remember what they say: "publish or perish"... he perished).
His notes were edited and published after his death. He studied logic and theology and at the age of 33 he became a minister in a village in Kent. Only in his later years Thomas Bayes took a deep interest in probability.
The general interpretation of statistical inference called Bayesian was in reality pioneered by Pierre-Simon Laplace. In fact, some argue that Thomas Bayes intended his results in a very limited way than modern Bayesians would intend them. In the special case Thomas Bayes presented, the prior and posterior distributions were Beta distributions and the data came from Bernoulli trials. Interestingly, early Bayesian inference was called "inverse probability", because it infers backwards from observations to parameters.

What¶
Bayesian statistics is an alternative to classical frequentist approaches, where maximum likelihood estimates (MLE) and hypothesis testing based on p-values are often used.
However, there is no definite division between frequentists and Bayesians as, in many modern applications, the approach taken is eclectic.
We now discuss further examples where a Bayesian approach seems the more appropriate strategy to adopt for statistics inference.
Imagine you have just submitted a manuscript for publication to a peer-reviewed journal. You want to assess its probability of being accepted and published.

This assessment may use, for instance, the information regarding the journal's acceptance rate, the quality of the study and its relevance to the journal's scope.
Your manuscript is accepted. What is the probability that your next manuscript will be accepted?
You had one success over one trial. Therefore, the probability is 100%. However, it looks clear that this estimate is somehow "wrong" as it is based on a small sample size and we know that the acceptance rate is anyway smaller than 100%.
You can think of the journal's acceptance rate (e.g. 20%) as our prior information. You are then tempted to set a probability of being accepted smaller than 100%. By doing so you are behaving as a Bayesian statistician, as you are adjusting the direct estimate in light of a prior information. Bayesian statistics have the ability to incorporate prior information into an analysis.
Suppose you are conducting an experiment of measuring the biodiversity of some species on particular rock shores in Scotland. Specifically, you are collecting the number of different species of algae in 4 different locations across time, over 3 years. Unfortunately, something happened in 2016 for Location B and you do not have data reported.
Year
Loc. A
Loc. B
Loc. C
Loc. D
2015
45
54
47
52
2016
41
n.a.
43
45
2017
32
38
37
35
What is a reasonable value for the missing entry? How about 100?
Perhaps 100 is too high since the numbers surrounding the entry may point towards a value of around 45. We could fit a model or take an average to impute the missing data.
Now assume that you have access to some data for Location B in 2016. Specifically, you have partial data where you could retrieve biodiversity levels only for a fifth $(1/5)$ of Location B for 2016. You extrapolate such partial value to obtain an estimate which turns out to be 100. Are you willing now to impute missing data with 100, extrapolated from some partial coverage, while before you thought this number was much higher than expected?
A more intuitive solution would be to take a sort of weighted average between this direct (but uncertain) measurement (100) and the indirect estimate you used (45, the average of surrouding cells) when there was no information available.
Finally, imagine that you can retrieve biodiversity values for half $(1/2)$ of Location B in 2016. If so, then you would like to "weight" more such observation than to the previous case where only a fifth of the area was available.
Bayesian statistics formalises such integration between direct and indirect information.
• The frequentist is based on imagining repeated sampling from a particular model, which defines the probability of the observed data conditional on unknown parameters.
• The likelihoodist uses the sampling model as the frequentists but all inferences are based on the observed data only.
• The Bayesian requires a sampling model (the likelihood) and a prior distribution on all unknown parameters. The prior and the likelihood are used to compute the conditional distribution of the unknown parameters given the observed data.
• The Empirical Bayesian (EB) allows the observed data to contribute to defining the prior distribution. 
To put it in a different perspective, assuming $D$ is the data and $\theta$ is your unknown parameter, the frequentist approach conditions on parameters and integrates over the data, $p(D|\theta)$.
On the other hand, the Bayesian approach conditions on the data and integrates over the parameters, $p(\theta|D)$.
Therefore, in Bayesian statistics we derive proper probability distributions of our parameters of interest, rather than deriving a point estimate. In other words, in Bayesian statistics a probability is assigned to a hypothesis, while under a frequentist inference, a hypothesis is tested without being assigned to a probability of occurring.
Unlike likelihoodist, Bayesian inferences can "accept" the null hypothesis rather than "fail to reject" it. Bayesian procedures can also impose parsimony in model choice and avoid further testing for multiple comparisons.
Why¶
Bayesian methods for data analysis are used because:
• of the increased computing power over the last years;
• they have good frequentist properties;
• their answers are more easily interpretable by non-specialists;
• they are already implemented in software packages.
Bayesian statistics is used in many topics in life sciences, such as genetics (e.g. fine-mapping of disease-susceptibility genes), ecology (e.g. agent-based models), evolution (e.g. inference of phylogenetic trees), bioinformatics (e.g. base calling).