Probability theory¶
Probability theory is the foundation for all statistical inferences. Through the use of models of experiments, we are able to draw inferences about populations based on examining only a part of the whole.
In this first lecture, we are going to outline the basic ideas of probability theory that are of direct importance for statistical inferences. As statistics builds upon probability theory, the latter builds upon the set theory.
Set theory¶
If one of our main objectives in statistics is to draw conclusions about a population of objects after an experiment, then it is essential to identify the possible outcomes of it.
The set $S$ of all possible outcomes of a particular experiment is called the sample space for the experiment.
If the experiment consists of tossing a con, then the sample space contains only two outcomes, heads and tails, and therefore: $S=\{H,T\}$. If, on the other hand, the experiment consists of observing the new GCSE scores of randomly selected pupils, the sample space would be the set of integers between 0 and 9, that is $S=\{0,1,2,…,8,9\}$. Finally, consider an experiment where the observation is the reaction time to a simulus. In this case, the sample space consist of all positive numbers, that is $S=(0,\infty)$.
TASK Imagine that our experiment consists on observing the nucleotidic sequence of a particular gene of interest. What is the sample space? $S_G=\{A,C,G,T\}?$ Now suppose that we are interested in making inference on the amino acidic sequence of a protein. What is the sample space? $S_P=?$ Finally, let’s suppose that our observations consist in divergence between orthologous genes of arbitrary length. In other words, for a large set of genes we calculate the relative difference in nucleotidic content between genes in different species. What is the sample space for such divergence? $S_D=?$
From these last examples, we evince that we can classify sample spaces into two types according to the number of elements they can contain. Sample spaces can be either countable or uncountable. If the sample space is finite or each element can be put into a 1-1 correspondence with a subset of integeres, the sample space is countable. Therefore, the toss coin and GCSE scores are countable whereas the reaction time consists of an uncountable sample space.
The distinction between countable and uncountable sample spaces, despite sometimes trivial, is of great importance as it dictates the way with probabilities can be assigned. In practice, probabilistic methods associated to uncountable sample sizes are less cumbersome and can provide an approximation to the true countable situation.
One the sample space has been defined, we can consider collections of possible outcomes of an experiments.
An event is any collection of possible outcomes of an experiment, that is, any subset of $S$, including $S$ itself.
Let $A$ ben an event, a subset of $S$. We say that the event $A$ occurs if the outcome of the experiment is in the set $A$. We can equate and sort sets: \begin{equation} A \subset B \iff x \in A \implies x \in B \end{equation} \begin{equation} A = B \iff A \subset B \text{ and } B \subset A \end{equation}
Given any two events $A$ and $B$ we can have several elementary set operations.
• Union: the union of $A$ and $B$ is the set of elements that belong to either $A$ or $B$ or both: \begin{equation} A \cup B = {x: x \in A \text{ or } x \in B} \end{equation}
• Intersection: the intersection of $A$ and $B$ is the set of elements that belong to both $A$ and $B$: \begin{equation} A \cap B = {x: x \in A \text{ and } x \in B} \end{equation}
• Complementation: the complement of $A$ is the set of all elements that are not in $A$: \begin{equation} A^c = {x:x \notin A} \end{equation}
TASK Consider an experiment of selection a nucleobase at random. The sample space is $S=\{A,C,G,T\}$ and some possible events are $A=\{G,C,A\}$ and $B=\{T,G\}$. What is the union of $A$ and $B$? What is the intersect between $A$ and $B$? What is the complement of $A$? What is the complement of the union of $A$ and $B$?
Using Venn diagrams is another useful way to visualise (but not prove) set operations. For very large sets it is convenient to write programs that will do the calculations. As an illustration, let’s solve the task above using R. There are several functions in R suitable for set operations, as outlined here.
In [ ]:
# define the sample space
S <- c("A","C","G","T")
# define A and B
A <- c("G","C","A")
B <- c("T","G")
# union
union(A,B)
# intersect
intersect(A,B)
# complement of A
setdiff(S,A)
# complement of A U B
print(setdiff(S,union(A,B)))
From these examples, we also found a special set, the empty set $\emptyset$. It is also worth noting that the elementary set operations can be combined. We can now state several useful properties of set operations.
• Commutativity \begin{equation} A \cup B = B \cup A \ A \cap B = B \cap A \end{equation}
• Associativity \begin{equation} A \cup (B \cup C) = (A \cup B) \cup C \ A \cap (B \cap C) = (A \cap B) \cap C \end{equation}
• Distributive laws \begin{equation} A \cap (B \cup C) = (A \cap B) \cup (A \cap C) \ A \cup (B \cap C) = (A \cup B) \cap (A \cup C) \end{equation}
• DeMorgan's laws \begin{equation} (A \cup B)^c = A^c \cap B^c \ (A \cap B)^c = A^c \cup B^c \end{equation}
TASK Assume we have a third set $C=\{C,T\}$ of nucleobases. Use R to show that the above properties hold.
In [ ]:
C = c("C","T")
# commutativity
setequal(union(A,B), union(B,A))
#...
The operations of union and intersection can be extended to infinite collections of sets. As an example, let $S=(0,1]$ and define $A_i = [(1/i),1]$. Then $\bigcup_{i=1}^\infty A_i = ?$ and $\bigcap_{i=1}^\infty A_i = ?$.
We finally discuss the idea of a partition of the sample space.
Two events $A$ and $B$ are disjoint (or mutually exclusive) if $A \cap B = \emptyset$. The events $A_1,A_2,...$ are pairwise disjoint if $A_i \cap A_j = \emptyset$ for all $i \neq j$.
Disjoint sets are sets with no points in common. For instance the collection $A_i = [i, i+1)$ with $i=0,1,2,...$ consists of pairwise disjoint sets.
If $A_1, A_2, ...$ are pairwise disjoint and $\bigcup_{i=0}^\infty A_i = S$, then the collection $A_1, A_2, ...$ forms a partition of S.
In statistical inference, partitions are very useful as they allow use to divide the sample space into small nonoverlapping pieces.
Basics of probability theory¶
As seen before, the realisation of an experiment is an outcome in the sample space. If the experiment is performed many times, different outcomes may occur each time or some outcomes may repeat (if the sample space is countable). It is immediate to think of such frequency of occurence as a probability. However, here we will think of probabilities using a mathematical axiomatic approach, in the sense that probabilities are defined by a function satisfying the axioms.
Axiomatic foundations¶
For each event $A$ in the sample space $S$ we associate a number between 0 and 1 that we call the probability of A, denoted as $P(A)$. Intuitively, for each $A \subset S$ we define $P(A)$ as the probability that $A$ occurs. However, we need to more formal in defining this.
A collection of subsets of $S$ is called a sigma algebra (or Borel field), denoted by $\mathcal{B}$, if it satisfies the following properties:
1. $\emptyset \in \mathcal{B}$
2. if $A \in \mathcal{B}$, then $A^c \in \mathcal{B}$
3. if $A_1, A_2, ... \in \mathcal{B}$, then $\bigcup_{i=1}^\infty A_i \in \mathcal{B}$
We can have many different sigma algebras associated with sample space $S$.
TASK Is the collection of sets $\{\emptyset, S\}$ a sigma algebra with $S$? Why?
If $S$ is finite or countable, then $\mathcal{B}$ contains all subsets of $S$ including $S$ itself.
TASK If $S$ has $n$ elements, there are $2^n$ sets in $\mathcal{B}$. What is $\mathcal{B}$ if $S=\{1,2,3\}$?
We can now define a probability function.
Given a sample space $S$ and an associated sigma algebra $\mathcal{B}$, a probability function is a function $P$ with domain $\mathcal{B}$ that satisfies:
1. $P(A) \geq 0$ for all $A \in \mathcal{B}$
2. $P(S)=1$
3. If $A_1, A_2, ... \in \mathcal{B}$ are pairwise disjoint, then $P(\bigcup_{i=1}^\infty)=\sum_{i=1}^\infty P(A_i)$
These properties are called the Axioms of Probability. Any function $P$ that satisfies these axioms is called a probability function.
TASK Consider the experiment of randomly drawing one nucleotide from a genome and record whether it's a purine ($U$) or pyrimidine ($Y$), so $S=\{U,Y\}$. Let's also assume that $P(\{U\})=P(\{Y\})$. What are possible values for $P(\{U\})$ and $P(\{Y\})$ in order for them to define a legitimate probability function?
There is a much more common method for defining a legitimate probability function.
Let $S={s_1, s_2, ..., s_n}$ a finite and/or countable set. Let $\mathcal{B}$ be any sigma algebra of subsets of $S$. Let $p_1, p_2, ..., p_n$ nonnegative numbers that sum to 1. For any $A \in \mathcal{B}$, define $P(A)$ by \begin{equation} P(A) = \sum_{{i:s_i \in A}} p_i \end{equation}. Then $P$ is a probability function on $\mathcal{B}$.
Calculus of probabilities¶
From the Axioms of Probabilities we can derive many properties of the probability function.
If $P$ is a probability function and $A$ is any set in $\mathcal{B}$, then
1. $P(\emptyset)=0$
2. $P(A) \leq 1$
3. $P(A^c) = 1 - P(A)$
We can easily prove the last one. $A$ and $A^c$ form a partition of the sample space, that is, $S=A \cup A^c$. Therefore, $P(A \cup A^c) = P(S) = 1$. Since $A$ and $A^c$ are disjoint, $P(A \cup A^c) = P(A) + P(A^c)$. By combining the last two equations we retrieve $P(A^c) = 1 - P(A)$.
If $P$ is a probability function and $A$ and $B$ are any sets in $\mathcal{B}$, then
1. $P(B \cap A^c) = P(B) - P(A \cap B)$
2. $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
3. if $A \subset B$, then $P(A) \leq P(B)$
These properties can be easily visualised with Venn diagrams. In particular, the second one gives a useful inequality for the probability of an intersection. Since $P(A \cup B) \leq 1$, we have \begin{equation} P(A \cap B) \geq P(A) + P(B) - 1 \end{equation} which is a special case of the Bonferroni's Inequality. This equation allows us to bound of the probability of a simultaneous event from the probabilities of the individual events. This is useful when it is difficult to calculate the intersection probability but we want to get some ideas of this probability.
If $P$ is a probability function, then
1. $P(A) = \sum_{i=1}^\infty P(A \cap C_i)$ for any partition $C_1, C_2, ...$ of $S$
2. $P(\bigcup_{i=1}^\infty A_i) \leq \sum_{i=1}^\infty P(A_i) $ for any sets $A_1, A_2, ...$
The last property is called Boole's Inequality which is a general version of the Bonferroni's Inequality.
Counting¶
Methods of counting are used to construct probability assignments on finite sample spaces. Counting is often difficult and the trick is to break it down to simpler tasks that are easy to count. In fact, the Fundamental Theorem of Counting states that if a job consists of $k$ separate tasks, the $i-th$ of which can be done in $n_i$ ways, $i=1,...,k$, then the entire job can be done in $n_1 \times n_2 \times ... \times n_k$ ways.
Let's assume that one protein domain is comprised of 6 amino acids. To be able to calculate the probability of having exactly one sequence we first must count how many different combinations of 6 amino acids can be observed. However, we need to specify whether the same amino acid can be present more than once. This distinction is between counting with replacement and counting without replacement. Furthermore, we may want to interested in the exact order of the amino acids or simply the presence of certain ones. This distinction is between counting ordered objects or counting unordered ones.
We can create a table to summarise the possible methods of counting.
Without replacement
With replacement
Ordered
Unordered
Ordered, without replacement¶
The first amino acid can be selected in 20 ways, the second in 19 ways, etc. So there are $20 \times 19 \times 18 \times 17 \times 16 \times 15 = \frac{20!}{14!}$ ways.
For a positive integer $n$, $n!$ (read $n$ factorial) is the product of all the positive integers less than or equal to $n$, that is, $n! = n \times (n-1) \times (n-2) \times ... \times 2 \times 1$. We define $0!=1$.
We can easily calculate this number in R.
In [ ]:
factorial(20)/factorial(14)
Ordered, with replacement¶
All amino acids can be selected in 20 ways if they are replaced. Therefore, there are $20\times20\times20\times20\times20\times20$ ways.
In [ ]:
20^6
Unordered, without replacement¶
Since we know the number of ways when the ordering must be taken into account, $\frac{20!}{14!}$, we need to divide this number by all the possible ways that 6 amino acid can be ordered. The latter is equal to $6!$. Therefore, the unordered counting of 6 amino acids without replacement is $\frac{20!}{6!14!}$.
In [1]:
factorial(20)/(factorial(6)*factorial(14))
38760
For nonnegative integers $n$ and $r$, where $n \geq r$, we define the symbol ${n \choose r}$, read $n \text{ choose } r$, as \begin{equation} {n \choose r} = \frac{n!}{r!(n-r)!} \end{equation}
These numbers are also called binomial coefficients.
Unordered, with replacement¶
This is the most difficult case. Imagine to assign 6 amino acids on the 20 possible values. We can reduce the ways of counting noting that we just need to keep track of the arrangement and that the two boundaries of the 22 interval bounds (since we can 20 bins) are not relevant. Therefore, we have $\frac{25!}{6!19!}$ ways.
In [ ]:
factorial(25)/(factorial(6)*factorial(19))
TASK
We can generalise all these derivations and summarise them in the following table.
Without replacement
With replacement
Ordered
?
?
Unordered
?
?
Enumerating outcomes¶
These counting methods are useful when the sample size $S$ is a finite set and all the outcomes are equally likely. In fact, probabilities of events can be simply calculated by counting the number of outcomes in the event.
Suppose that $S=\{s_1,...,s_n\}$ is a finite sample space. If all the outcomes are equally likely, then $P(\{s_i\})=1/N$ for every outcome $s_i$. Then, we have \begin{equation} P(A) = \sum_{s_i \in A} P({si}) = \sum{s_i \in A} \frac{1}{N} = \frac{\text{# elements in } A}{\text{# elements in } S} \end{equation}
Conditional probability and independence¶
So far all probabilities have been unconditional. All probabilities were calculated with respect to that sample space. Often we want to update the sample space based on new information. In these cases, we need to calculate conditional probabilities.
If $A$ and $B$ are events in $S$, and $P(B)>0$, then the conditional probability of A given B, written $P(A|B)$, is \begin{equation} P(A|B) = \frac{P(A \cap B)}{P(B)} \end{equation}
The intuition here is that our original sample space $S$ has been updated to $B$, so that $P(B|B)=1$. If $A$ and $B$ are disjoints, then $P(A \cap B)=0$ and therefore $P(A|B)=P(B|A)=0$.
If $A$ and $B$ are events in $S$, and $P(B)>0$, then the conditional probability of A given B, written $P(A|B)$, is \begin{equation} P(A|B) = \frac{P(A \cap B)}{P(B)} \end{equation}
The original sample space $S$ has been updated to $B$, so that $P(B|B)=1$. If $A$ and $B$ are disjoints, then $P(A \cap B)=0$ and $P(A|B)=P(B|A)=0$.
Two events, $A$ and $B$, are statistically independent if \begin{equation} P(A \cap B) = P(A)P(B) \end{equation} If the occurrence of $B$ has no effect on the probability of $A$, then $P(A|B)=P(A)$.
If $A$ and $B$ are independent events, then the following pairs are also independent:
• $A$ and $B^c$
• $A^c$ and $B$
• $A^c$ and $B^c$
A collection of events $A_1, A_2, …, A_n$ are mutually independent if for any subcollection $A_{i_1},…,A_{i_k}$, we have \begin{equation} P(\bigcap{j=1}^k A{ij} ) = \prod{j=1}^k P(A_{i_j}) \end{equation}
Simultaneous independence requires a strong definition.
Random variables¶
Often it is easier to work with a summary variable than with the full original structure. For example, we might be interested in the GC content of a nucleotidic sequence $\{A,C,G,T\}$ of 50 base pairs. If we record “1” for a GC base and “0” otherwise (A or T), the sample space for this experiment has $2^{50}$ elements, each one comprising an ordered string of 1s and 0s of length 50.
We should reduce this size by agreeing that the quantity of interest is the number of bases that are GC out of 50. If we define $X$ = number of 1s recorded out of 50, we have captured this requirement. Now the sample space for $X$ is the set of integers $\{0,1,2,…,50\}$ and it is much easier to deal with than the original sample space.
When we define $X$, we create a mapping (a function) from the original sample space to a new sample space.
A random variable is a function from a sample space $S$ into the real numbers.
In defining a random variable, we also defined a new sample space. Can our probability function on the original sample space be used for the random variable?
Suppose we have a sample space $S=\{s_1,…,s_n\}$ with a probability function $P$ and we define a random variable $X$ with range $\mathcal{X}=\{x_1,…,x_m\}$. We can define a probability function $P_X$ on $\mathcal{X}$ as following. We observe $X=x_i$ if and only if the outcome of the experiment is an $s_j \in S$ such that $X(s_j)=x_i$. More formally, \begin{equation} P_X(X=x_i) = P({s_j \in S : X(s_j)=x_i} ) \end{equation} where $P_X$ is an induced probability function on $\mathcal{X}$. $P_X$ satisfies the Kolmogorov Axioms. If $\mathcal{X}$ is uncountable, then \begin{equation} P_X(X \in A) = P({s \in S : X(s) \in A} ) \end{equation}
Distribution functions¶
With every random variable $X$, we associate a function called the cumulative distribution function of $X$.
The cumulative distribution function or cdf of a random variable $X$, denoted by $F_X(x)$ is defined by \begin{equation} F_X(x) = P_X(X \leq x) \text{, for all } x \end{equation}
TASK Consider the experiment of sampling three nucleotidic bases and let $X$ be the number of GC bases observed. What is the cdf of $X$? Can you plot it?
\begin{align*} F_X(x) &= 0 \text{, if } -\infty < x < 0 \\ F_X(x) &= ? \text{, if } \\ ... \end{align*}
Note tha the step function $F_X(x)$ is defined for all values of $x$, not just those is $\mathcal{X}$. Also note that $F_X(x)$ has jumps at values of $x_i \in \mathcal{X}$ and the size of the jump at $x_i$ is equal to $P(X=x_i)$. $F_X(x)$ can be discontinuous but has the property of right-continuity.
The function $F_X(x)$ is a cdf if and only if the following three conditions hold:
1. $\lim_{x \rightarrow -\infty} F_X(x)=0$ and $\lim_{x \rightarrow \infty} F_X(x)=1$.
2. $F_X(x)$ is a nondecreasing function of $x$.
3. $F_X(x)$ is right-continuous.
Whether a cdf is continuous or not corresponds to the associated random variable being continuous or not.
A random variable $X$ is continuous if $F_X(x)$ is a continuous function of $x$. A random variable $X$ is discrete if $F_X(x)$ is a step function of $x$.
It is important to understand that $F_X(x)$ completely determines the probability distribution of a random variable $X$.
The random variables $X$ and $Y$ are identically distributed if, for every set $A \in \mathcal{B}^1$ (the smallest algebra sigma), $P(X \in A) = P(Y \in A)$.
If $X$ and $Y$ are identically distributed, $F_X(x)=F_Y(x)$ for every $X$.
TASK Consider the experiment of sampling nucleotidic bases and let $X$ be the number of GC bases observed and $Y$ the number of AT bases observed. Are $X$ and $Y$ identically distributed?
Density and mass functions¶
Apart from its cdf $F_X$, another function is associated with a random variable $X$. This function is called either the probability density function (pdf) or probability mass function (pmf).
The probability mass function (pmf) of a discrete random variable $X$ is given by \begin{equation} f_X(x) = P(X=x) \text{ for all } x \end{equation}
Recall that $f_X(x)$ is the size of the jump in the cdf at $x$. From the pmf, we can calculate probabilities of events, as \begin{equation} P(a \leq X \leq b) = \sum_{k=a}^b f_X(k) \end{equation} for positive integers $a$ and $b$, with $a \leq b$.
We need to be more careful in the continuous case. In the discrete we can sum over values of the pmf to get the cdf. Similarly, in the continuous case we substitute integrals for sum. \begin{equation} P(X \leq x) = FX(x) = \int{-\infty}^x f_x(t) dt \end{equation} If $f_X(x)$ is continuous, then \begin{equation} \frac{d}{dx} F_X(x) = f_X(x) \end{equation}
The probability density function or pdf, f_X(x), of a continuous random variable $X$ is the function that satisfies \begin{equation} FX(x) = \int{-\infty}^{x} f_X(t) dt \text{ for all } x \end{equation}
The expression "$X$ has a distribution given by $F_X(x)$" can be written as "$X \sim F_X(x)$". We can also write $X \sim f_X(x)$ or $X \sim Y$ if $X$ and $Y$ have the same distribution. The pdf (or pmf) contains the same information as the cdf. We can use either one to solve problems.
TASK Assuming that the cdf is \begin{equation} F_X(x) = \frac{1}{1+e^{-x}} \end{equation} What is the pmf?
A function $f_X(x)$ is a pdf (or pmf) of a random variable $X$ if and only if
1. $f_X(x) \geq 0$ for all $x$
2. $\sum_x f_X(x)=1$ (pmf) or $\int_{-\infty}^{\infty} f_X(x) dx =1 $ (pdf)
Probability distributions¶
There are a number of fundamental probability distributions that should be studied and understood in detail. Discrete probability distributions include: the uniform discrete, Bernoulli, binomial, geometric and Poisson distributions. It is also straightforward to know which R functions can generate such distributions.
In [ ]:
# e.g. Binomial distribution
?dbinom
Univariate continuous probability distributions are also very much used in modelling. These distributions include the uniform continuous, exponential, Normal, Chi-squared ($\chi^2$), log-normal, Gamma.
In [ ]:
# e.g. Normal distribution
?dnorm
For all these distributions, we can calculate notable quantities, including expectations and variance. The expected value of a continuous random variable $X$ is \begin{equation} E[X] = \int_{-\infty}^{\infty} x f_X(x) dx \end{equation} The variance of a continuous random variable is \begin{equation} Var[X] = E[(X-E[X])^2] = E[X^2] - (E[X])^2 \end{equation}
There are several rules for manipulating expected values and variances.
• If $X$ is a continuous random variable with pdf $f_X(x)$, then for any real-valued function $g$ \begin{equation} E[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx \end{equation}
• If $a$ and $b$ are constants, $E[aX+b]=aE[X]+b$ and $Var[aX+b]=a^2Var[X]$
In [ ]: