21st October 2019
Inference
https://bitbucket.org/mfumagal/
statistical_inference
Matteo Fumagalli
Intended Learning Outcomes
By the end of this session, you will be able to:
Explain the difference between population and sample statistics
Describe data using a range of descriptive and graphical summaries
Illustrate the properties of estimators and principles of hypothesis testing
From probability theory to statistics
Populations and random samples
A population is the set of all units or objects one intends to study: UK human population (e.g. height)
Human T-lymphotropic Virus-1 (HTLV-1) infected T-cells Gut microbiota
…
Populations and random samples
Although we would like to study the whole population, observing all the units in the population is often not possible:
Measurement costs are too high (e.g. census)
No access to all units (e.g. online survey respondents, T-cells in a blood sample, bacteria in faecal sample)
A random sample is a subset of the population that is representative of the population.
Stages of a statistical analysis
There are four broad stages:
1 Select measurement variables
2 Perform random sampling
3 Construct one or more statistical models
4 Perform data analysis:
Descriptive statistics Inferential statistics
1. Measurement variables
The initial stage of any experimental analysis involves the selection of variables to observe and their measurement scale.
There are two types of data: Quantitative
1. Measurement variables
The initial stage of any experimental analysis involves the selection of variables to observe and their measurement scale.
There are two types of data: Quantitative
Continuous (e.g. concentration) vs. Discrete (e.g. cell counts) Univariate (e.g. light intensity) vs. Multivariate (e.g. chemotaxis: 2D movement vector, microarray data)
Qualitative
1. Measurement variables
The initial stage of any experimental analysis involves the selection of variables to observe and their measurement scale.
There are two types of data: Quantitative
Continuous (e.g. concentration) vs. Discrete (e.g. cell counts) Univariate (e.g. light intensity) vs. Multivariate (e.g. chemotaxis: 2D movement vector, microarray data)
Qualitative
Nominal: No logical ordering (categories, classes, binary data.
e.g. healthy/diseased, male/female, smoker/non-smoker, etc) Ordinal: Codes with logical ordering (e.g. exam grades)
2. Random sampling
Most statistical analyses assume the following: Extract a random sample of n units
All the measurements are collected in sample data D = {x1,…,xn}
The elements of D are random realisations of n random variables {X1, …, Xn}
Assume variables are independent and identically distributed (i.i.d)
The underlying probability function or density function is fX(x) ≡ fX(x;θ) where θ is a parameter or parameter vector.
3. Statistical models
A statistical model is a set {fX (x ; θ)|θ ∈ Θ} of probability measures, one of which corresponds to the true, unknown, probability measure p(x;θ∗) that produced the data.
θ is the parameter of the model, and Θ is the parameter space.
3. Statistical models
A statistical model is a set {fX (x ; θ)|θ ∈ Θ} of probability measures, one of which corresponds to the true, unknown, probability measure p(x;θ∗) that produced the data.
θ is the parameter of the model, and Θ is the parameter space.
θ∗ is the true or population parameter such that p(x;θ∗) is the true
probability measure for the data.
We decide on the appropriate model either by using prior knowledge of the data generating process or by using exploratory statistical tools.
The data space X and parameter space Θ are related by different spaces!
4. Data analysis
Descriptive statistics is the discipline of summarising and describing data (e.g. quantitative summaries and visual representations of the data)
Inferential statistics is the branch of statistics which attempts to draw conclusions about the the population from random samples
Descriptive statistics: summaries of the data
Given a random sample, one produces numerical and visual summaries of the data in order to:
Detect trends or features in the observed data Detect outliers
Suggest appropriate statistical models, often in the absence of prior knowledge (e.g. bi-modality in histograms suggests mixture models)
Typical quantitative summaries of data fall into several classes: location or central tendency measures, mode, scale (or spread of the data), skewness (or asymmetry).
Central tendencies
arithmetic mean (or sample mean) weighted arithmetic mean geometric mean
harmonic mean
Central tendencies
Central tendencies
Scale
For univariate sample data, the common summary measures for scale are
sample minimum
sample maximum
sample range
pth-quantile of the empirical distribution function: median, lower and upper quartiles, inter-quartile range.
(unbiased) sample variance
1 n
s n2 − 1 = n − 1
( x i − xˆ ) 2
i=1
Skewness
Third and higher moments of a distribution can provide useful descriptions.
Skewness measures the departure from symmetry of the probability distribution of a real-valued random variable.
Graphical summaries
In addition to the preceding quantitative summaries, it is often useful to provide visual or graphical summaries.
The type of graphs/plots depend on the data obtained, e.g. Discrete or continuous, Uni- or multivariate.
Besides providing a summary of the data, graphical summaries can also illustrate any testing we may wish to perform.
Discrete data visualisation – Barplot and piecharts
Continuous data visualisation – Histograms and density estimates
Continuous data visualisation – Boxplots
Bivariate data visualisation – Scatterplots
Probability plotting techniques – QQ plot
Inferential statistics
It provides the mathematical theory for inferring properties of an unknown distribution from data generated from that distribution.
In the parametric approach, one selects a suitable distribution and attempt to infer the parameters from the data.
In a non-parametric approach, one does not make any assumption about the underlying distribution of the data.
There are several approaches to statistical inference: frequentist, likelihoodist, Bayesian.
Statistical inference
Assume that we know and accept a statistical model fX(x;θ).
The main objective of parametric statistical inference is to learn something about the unknown population parameter θ from information contained in sample data.
We will use an estimator or statistic, a function of the sample data, to learn about θ.
Statistical inference
Common tasks are:
1 Point estimation: obtain a single estimate θˆ of the population
parameter θ
2 Interval estimation: obtain an interval having a certain
probability to contain the unknown population parameter θ
3 Hypothesis testing: test a specific hypothesis about θ, i.e. do the observed data support the hypothesis?
Estimators
An estimator or statistic is a function of the random sample D = {x1, …, xn}, say t(D).
t(D) depends on the data sample alone
We have already encountered several statistics, e.g. the
sample mean and the sample variance
Because an estimator is a function of random variables, it is itself a random variable with its own distribution. We usually refer to the latter as the sampling distribution of the sample statistic.
Estimators
Estimators are used to estimate the unknown population parameters.
For instance, if we assume some parametric model with population mean parameter μ, we may want to use xˆ as an estimate for μ.
Often, it is not trivial to construct estimators. One approach is the maximum likelihood estimator (more on this later).
Constructing and characterising estimators
We encountered several measures of central tendency – the (arithmetic) sample mean, median, geometric mean, etc.
Which estimator is the ”best” one for estimating the population mean parameter θ? How is one estimator ”better” than another?
Constructing and characterising estimators
We encountered several measures of central tendency – the (arithmetic) sample mean, median, geometric mean, etc.
Which estimator is the ”best” one for estimating the population mean parameter θ? How is one estimator ”better” than another?
Since estimators are random variables, we compare them by assessing their respective sampling distribution (when known)
We will look at the following properties: 1. Bias, 2. Mean Squared Error (MSE), 3. Efficiency, 4. Consistency
Bias
Suppose we could repeat the same experiment a number of times, say B, and collect new data D each time.
Each time we use our estimator t(D) of choice to obtain an estimate θˆ of an unknown population parameter θ.
We have a set of different estimates which are of samples from the sampling distribution of the estimator/sample-statistic t(D).
Bias
Suppose we could repeat the same experiment a number of times, say B, and collect new data D each time.
Each time we use our estimator t(D) of choice to obtain an estimate θˆ of an unknown population parameter θ.
We have a set of different estimates which are of samples from the sampling distribution of the estimator/sample-statistic t(D).
The bias of an estimator t(D) is defined as bias(t(D)) = Eθ(t(D)) − θ
An unbiased estimator is one with zero bias, i.e. Eθ(t(D)) = θ. jupyter-notebook: inference
Mean squared error (MSE) and standard error of an estimate
How much variability do we expect to see in θˆ under repeated sampling from the assumed distribution?
Mean squared error (MSE) and standard error of an estimate
How much variability do we expect to see in θˆ under repeated sampling from the assumed distribution?
A common measure of the spread of the sampling distribution t(D) around the true parameter θ is given by the mean squared error (MSE)
MSE(t(D)) = E(t(D) − θ)2
It measures the average spread difference between the estimator and θ.
Mean squared error (MSE) and standard error of an estimate
The MSE has two components:
the variability of the estimator (precision) its bias (accuracy)
For an unbiased estimator, the MSE equal its variance and the standard error is the square root of the MSE.
Efficiency
All things equal, we choose the unbiased estimator with the smallest variance, i.e. with higher precision.
The efficiency of t1(D) relative to t2(D) is
efficiency = Var(t2(D)) Var(t1(D))
Consistency
Consistency is an asymptotic property of an estimator.
It describes the behaviour of the estimator t(D) as the sample size
n gets larger and larger. Hence it involves a sequence of estimators.
Two sufficient conditions for consistency are: limn→∞ E(tn) = θ
limn→∞ Var(tn) = 0
Wrap up
1 Stages of a statistical analysis:
1 Define measurement variables
2 Perform random sampling
3 Choose a statistical model
4 Perform data analysis
2 Descriptive statistics data summaries
visualisation
3 Statistical inference
estimators and their properties …
Model fitting
How do we fit models to data? A model will have one or more parameters, θ, that we need to estimate to get a good fit.
e.g. we may want to model some observations as being normally distributed with mean μ and standard deviation 1.
What is μ?
Model fitting
What is μ?
μ = 8?
Model fitting
What is μ?
μ = 12?
Model fitting
What is μ?
μ = 10?
Likelihood
The concept of likelihood provides us with a formal framework for estimating parameters.
In particular, maximum likelihood estimation is a general method for estimating the parameters of a (probability) model.
The likelihood function is also one of the key ingredients for Bayesian inference.
Likelihood
Likelihood
Likelihood
Likelihood
Likelihood
Likelihood
Likelihood
Likelihood
Maximum likelihood estimate
Wrap up
jupyter-notebook: inference
Statistical tests
In our study of statistical inference, we have focused on point estimation of unknown parameters of a given statistical model.
The residual uncertainty is expressed as uncertainty in these point estimates (e.g. sampling distribution of estimator). We will now look at hypothesis testing
Make a definitive hypothesis about θ
Uncertainty is reflected in the expected probability of being wrong (p-value)
Statistical test
We will encounter a wide range of statistical tests:
Parametric vs. Non-parametric
One-sided vs. Two-sided tests
Z-test, t-test, goodness-of-fit tests, likelihood ratio test, etc…
Statistical test
We will encounter a wide range of statistical tests:
Parametric vs. Non-parametric
One-sided vs. Two-sided tests
Z-test, t-test, goodness-of-fit tests, likelihood ratio test, etc…
However, there is a common approach to all statistical tests:
1 Generate a Null Hypothesis and an Alternative Hypothesis
2 Obtain the sampling distribution of the estimator under the null hypothesis: the Null distribution
3 Decide whether to reject the null hypothesis
Null and Alternative Hypotheses
Suppose we want to know if the use of a drug is associated to a symptom.
We take some mice and randomly divide them into two groups.
We expose one group to the drug and leave the second group unexposed.
We then compare the symptom rate in the two groups, θ1, θ2 respectively.
Null and Alternative Hypotheses
Consider the following two hypotheses:
The Null Hypothesis: The symptom rate is the same in the
two groups, H0 : θ1 = θ2
The Alternative Hypothesis: The symptom rate is not the
same in the two groups, H1 : θ1 ̸= θ2
If the exposed group has a much higher rate of symptom than the unexposed group, we will reject the null hypothesis and conclude that the data favours the alternative hypothesis.
Null and Alternative Hypotheses
Let Θ be the parameter space of a statistical model.
We partition Θ into two disjoint sets Θ0 and Θ1 and test
H0 =θ∈Θ0 vs. H1 =θ∈Θ1
H0 and H1 are the Null and Alternative hypotheses, respectively.
Null and Alternative Hypotheses
A hypothesis of the form θ = θ0 is called a simple hypothesis. Ahypothesisoftheformθ>θ0 (orθ<θ0 )iscalleda
composite hypothesis.
AtestoftheformH0 :θ=θ0 vs. H1 :θ̸=θ0 iscalleda two-sided test.
AtestoftheformH0 :θ=θ0 vs. H1 :θ<θ0 (or>)iscalled a one-sided test.
The Null distribution
Let D be a random sample of data.
We test a hypothesis by finding a set of outcomes R called the rejection region.
If D ∈ R we reject the null hypothesis in favour of the alternative hypothesis
if D ̸∈ R, we fail to reject H0 Can we ever accept H0?
The Null distribution
Let D be a random sample of data.
We test a hypothesis by finding a set of outcomes R called the rejection region.
If D ∈ R we reject the null hypothesis in favour of the alternative hypothesis
if D ̸∈ R, we fail to reject H0 Can we ever accept H0? No!
The Null distribution
R can be in the form R = {D : t(D) > c} where D is the data, t(D) is the test statistic, and c is the critical value(s).
The critical value is determined from two things:
The null distribution, i.e. the sampling distribution of t(D), assuming the null hypothesis is true.
The significance level α (typically α << 1).
Significance and confidence intervals
When there is enough evidence to reject H0 , we say that the result from a statistical test is ”statistically significant at the given significance level α”.
The significance level α controls how stringent the test is.
When α is very small, the acceptance region is larger and the test is more stringent because the null hypothesis will be rejected less frequently.
The probability of making an error and wrongly rejecting the null hypothesis when it is in fact true is exactly α.
Significance and confidence intervals
Statistical significance does not imply scientific significance!
It is often more informative to give confidence intervals.
A (1 − α)-confidence interval (C.I.) is an interval in which the true parameter lies with probability 1 − α.
p-values
We have established a decision rule: we reject H0 when the value of the test statistic falls outside the acceptance region.
Other than a reject/accept decision, the result of a statistical test is often reported in terms of a p-value.
A p-value is
the probability, under the null hypothesis, of a result as or
more extreme than that actually observed.
the smallest significance level at which the null hypothesis would be rejected.
p-values
A p-value is NOT the probability that the null hypothesis is true. i.e. do not confuse the p-value with P(H0|D).
A large p-value can occur for two reasons:
(i) H0 is true or
(ii) H0 is false but the test has low power (e.g. too few samples).
Type I and II errors
When we test the null hypothesis versus the alternative hypothesis of a population parameter, there are four possible outcomes, two of which are erroneous:
A Type I error is made when the null hypothesis is wrongly rejected.
A Type II Error is made when we conclude that we do not have enough evidence to reject the H0 , but in fact the alternative hypothesis H1 is true.
The power of a test
The power of a test represents its ability to correctly reject the null hypothesis. It is expressed as the probability
P(Reject H0|H1True) = 1 − P(Do not reject H0|H1True) = 1 − P(Type II error)
Statistical tests
There are several widely used tests:
The Wald test: test the true value of the parameter based on the sample estimate.
t-test: to determine if two sets of data are significantly different from each other, assuming normality.
Wilcoxon signed-rank test: non-parametric, to compare two related samples to assess whether their population means are different.
Chi-squared goodness-of-fit test: to test whether observed sample frequencies differ from expected frequencies.
The likelihood ratio test.
Chi-squared goodness-of-fit test
Chi-squared goodness-of-fit test
Chi-squared goodness-of-fit test
Example: we observe two alleles, A and G, for a specific genomic locus in a population. Specifically, we observe 14 genotypes AA, 4 genotypes AG and 2 genotypes GG.
Chi-squared goodness-of-fit test
Example: we observe two alleles, A and G, for a specific genomic locus in a population. Specifically, we observe 14 genotypes AA, 4 genotypes AG and 2 genotypes GG.
Can we reject the hypothesis of Hardy Weinberg Equilibrium (HWE, i.e. random mating and no natural selection) for this locus? Note that, under HWE, we expect the following frequencies for the associated genotypes:
homozygous f 2
heterozygous 2f (1 − f )
homozygous (1 − f )2 jupyter-notebook: inference
Likelihood Ratio Test
Bootstrapping
It relies on random sampling with replacement to assign measure of accuracy to sample estimates.
The idea is to infer population parameter by resampling the data and performing inferences on the sample from the resampled data.
It assumes that samples are independent (otherwise use block
bootstrap for correlated samples).
jupyter-notebook: inference
Wrap up
General procedure to test statistical hypotheses about a population parameter θ:
1 Set up the null and alternative hypotheses for θ, H0 and H1 (one-sided or two-sided).
Wrap up
General procedure to test statistical hypotheses about a population parameter θ:
1 Set up the null and alternative hypotheses for θ, H0 and H1 (one-sided or two-sided).
2 Compute the test statistic and obtain the null distribution, i.e. the sampling distribution under the null hypothesis.
Wrap up
General procedure to test statistical hypotheses about a population parameter θ:
1 Set up the null and alternative hypotheses for θ, H0 and H1 (one-sided or two-sided).
2 Compute the test statistic and obtain the null distribution, i.e. the sampling distribution under the null hypothesis.
3 Choose a significance level α or confidence level 1 − α.
Wrap up
General procedure to test statistical hypotheses about a population parameter θ:
1 Set up the null and alternative hypotheses for θ, H0 and H1 (one-sided or two-sided).
2 Compute the test statistic and obtain the null distribution, i.e. the sampling distribution under the null hypothesis.
3 Choose a significance level α or confidence level 1 − α.
4 Determine the rejection region for the test statistic, which depends on α.
Wrap up
General procedure to test statistical hypotheses about a population parameter θ:
1 Set up the null and alternative hypotheses for θ, H0 and H1 (one-sided or two-sided).
2 Compute the test statistic and obtain the null distribution, i.e. the sampling distribution under the null hypothesis.
3 Choose a significance level α or confidence level 1 − α.
4 Determine the rejection region for the test statistic, which
depends on α.
5 Apply the decision rule: reject H0 in favour of H1 if the test
statistic falls in the rejection region. Otherwise conclude that there is insufficient evidence to reject H0.
Wrap up
General procedure to test statistical hypotheses about a population parameter θ:
1 Set up the null and alternative hypotheses for θ, H0 and H1 (one-sided or two-sided).
2 Compute the test statistic and obtain the null distribution, i.e. the sampling distribution under the null hypothesis.
3 Choose a significance level α or confidence level 1 − α.
4 Determine the rejection region for the test statistic, which
depends on α.
5 Apply the decision rule: reject H0 in favour of H1 if the test
statistic falls in the rejection region. Otherwise conclude that
there is insufficient evidence to reject H0.
6 Report the p-value and construct the confidence interval.