Conjugate prior distributions
Bayesian Statistics Statistics 4224/5224 Spring 2021
January 19, 2021
1
The following notes summarize Sections 1.3 and 2.1–2.6 of Bayesian Data Analysis, by Andrew Gelman et al. Material is also taken from Chapter 3 of A First Course in Bayesian Statis- tical Methods, by Peter D. Hoff, and Sections 1.3–1.5 & 2.1 of Bayesian Statistical Methods, by Reich and Ghosh.
Agenda
1. Bayesian inference
2. Summarizing the posterior
3. The posterior predictive distribution 4. Conjugate priors
(a) Estimating a proportion from binomial data
(b) Estimating a rate from count data (Poisson model)
(c) Estimating a normal mean (d) Estimating a normal variance
2
Bayesian inference
The pdf or (or pmf) of the data given the parameters, p(y|θ), is called the likelihood function.
Statistical inference is concerned with the inverse problem of using the likelihood function to estimate θ.
Bayesian inference quantifies uncertainty about the unknown pa- rameters by treating them as random variables.
Treating θ as a random variable requires specifying the prior distribution, p(θ), which represents our uncertainty about the parameters before we observe the data.
3
If we view θ as a random variable, we can apply Bayes’ rule to obtain the posterior distribution
where
p(θ|y) = p(θ, y) = p(θ)p(y|θ) p(y) p(y)
p(y) = p(θ, y)dθ = p(θ)p(y|θ)dθ .
Noting that p(y) does not depend on θ and, with fixed y, can
thus be considered a constant, yields the equivalent form p(θ|y) ∝ p(θ)p(y|θ) .
The Bayesian framework provides a logically consistent frame- work to use all available information to quantify uncertainty about model parameters.
4
How do we pick the prior?
In many cases prior knowledge from experience, expert opinion or similar studies is available and can be used to specify an in- formative prior.
In other cases where prior information is unavailable, then the prior should be uninformative to reflect this uncertainty.
The choice of prior distribution is subjective, i.e., driven by the analyst’s past experience and personal preferences.
5
How do we pick the likelihood?
The likelihood function is the same as in a classical analysis. Consider multiple linear regression:
• Which covariates to include?
• Distributional assumption about errors?
• Quadratic terms? Interaction effects?
• Transformation of response?
• Treatment of outliers? Partially missing data?
The choice of likelihood is subjective also!
The Bayesian inferential framework provides a logical foundation to accommodate objective and subjective parts involved in a data analysis.
6
Summarizing the posterior
The final output of a Bayesian analysis is the posterior distribu- tion of the model parameters.
The posterior contains all the relevant information from the data and the prior, and thus all statistical inference should be based on the posterior distribution.
A univariate posterior is best summarized with a plot, because this retains all the information about the parameter.
7
Point estimation
A point estimate is single value that represents the best estimate of the parameter given the data and the prior.
The posterior mean, median and mode are all sensible choices.
Each summary has its own interpretation (for example, the mode may be interpreted as the single ‘most likely’ value).
Point estimators should be accompanied by a posterior variance or standard deviation to convey uncertainty.
If the posterior is approximately normal, then the posterior prob- ability that the parameter is within two posterior SDs of the posterior mean is approximately 0.95.
8
Posterior quantiles and intervals
If the posterior p(θ|y) is approximately normal, the interval E(θ|y) ± 2 SD(θ|y)
is an approximate 95% posterior interval.
More generally, a 100(1 − α)% posterior interval is any interval
[l(y),u(y)] that satisfies
Pr[l(y) < θ < u(y)|y] = 1−α .
For the 100(1 − α)% central posterior interval, l(y) and u(y) are set to the α/2 and 1 − α/2 posterior quantiles; that is, they are chosen so that
Pr[θ < l(y)|y] = Pr[θ > u(y)|y] = α/2 .
9
The posterior predictive distribution
Let y ̃ be the future observation we would like to predict.
Assuming the observations are independent given the parame- ters, then given θ we have p(y ̃|θ,y) = p(y ̃|θ), and prediction is straightforward.
Unfortunately, we do not know θ exactly, even after observing y. A remedy for this is to ‘plug in’ a value of θ, say the posterior
mean θˆ = E(θ|y), and assume
p(y ̃|y) ≈ p(y ̃|θ) .
ˆ
10
If the posterior variance of θ is small then its uncertainty is neg- ligible, otherwise a better approach is needed.
For the sake of prediction, the parameters are not of interest themselves, but rather they serve as vehicles to transfer infor- mation from the data to the predictive model.
The distribution of a new outcome given the observed data,
called the posterior predictive distribution, accounts for para-
metric uncertainty because it can be written
p(y ̃|y) = p(y ̃, θ|y)dθ = p(y ̃|θ)p(θ|y)dθ .
To further illustrate how the PPD accounts for parametric un-
certainty, we consider how to make a sample from the PPD:
ssss
If θ ∼ p(θ|y), and y ̃ ∼ p(y ̃|θ ), then y ̃ ∼ p(y ̃|y).
11
Estimating a probability from binomial data
A random variable y ∈ {0, 1, . . . , n} has a Binomial(n, θ) distribu- tion if
p(y|θ) = dbinom(y|n, θ) = n! θy(1 − θ)n−y y!(n − y)!
for y = 0,1,…,n.
An uncertain quantity θ, known to be between 0 and 1, has a
Beta(a,b) distribution if
p(θ) = dbeta(θ|a, b) = Γ(a + b) θa−1(1 − θ)b−1 Γ(a)Γ(b)
for 0 < θ < 1.
12
Suppose θ ∼ Beta(a, b) and y|θ ∼ Binomial(n, θ). Having observed y,
p(θ|y) = p(θ)p(y|θ) p(y)
= 1 Γ(a + b)θa−1(1 − θ)b−1 n! p(y) Γ(a)Γ(b) y!(n − y)!
= c(a, b, n, y)θa+y−1(1 − θ)b+n−y−1 = dbeta(θ|a+y,b+n−y)
θy(1 − θ)n−y
13
The second to last line says that p(θ|y) is, as a function of θ, proportional to θa+y−1(1 − θ)b+n−y−1.
This means that it has the same shape as the Beta(a+y, b+n−y) density.
This means that it is the Beta(a + y, b + n − y) density. More succinctly,
p(θ|y) ∝ p(θ)p(y|θ) ∝ θa−1(1 − θ)b−1θy(1 − θ)n−y
= θa+y−1(1 − θ)b+n−y−1 ∝ dbeta(θ|a + y, b + n − y) 14
Combining information
If θ|y ∼ Beta(a+y,b+n−y), then
E(θ|y)= a+y = a+b · a + n ·y
a+b+n a+b+n a+b a+b+n n
For this model and prior distribution, the posterior expectation is a weighted average of the prior expectation and the sample average, with weights proportional to a + b and n, respectively.
The posterior variance is
var(θ|y) = (a + y)(b + n − y) = E(θ|y)[1 − E(θ|y)]
(a + b + n)2(a + b + n + 1) a + b + n + 1 15
The posterior predictive distribution
In the binomial example with the Beta(a,b) prior, we might be interested in the outcome of a new trial. Letting y denote the number of successes in n trials, and y ̃ the result of an (n + 1)st trial,
Pr(y ̃ = 1|y) =
=
Pr(y ̃ = 1|θ, y)p(θ|y)dθ
θp(θ|y)dθ = E(θ|y)
1 0
1 0
=a+y. a+b+n
16
Conjugate priors
The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy; the beta prior is a conjugate family for the binomial likelihood. The conjugate family is mathematically convenient, in that the pos- terior distribution follows a known parametric form.
In addition, conjugate prior distributions have the practical ad- vantage of being interpretable as additional data.
For Y |θ ∼ Binomial(n, θ) where θ ∼ Beta(a, b), E(θ|y)= a+b · a + n ·y
a+b+n a+b a+b+n n
and thus the Beta(a,b) prior can be interpreted as a ‘prior data set’ of a successes and b failures in a + b trials.
17
Binomial model for estimating a proportion
If y|θ ∼ Binomial(n, θ) then the likelihood function can be written, for any y = 0,1,...,n, as
p(y|θ) = n! θy(1 − θ)n−y n!(n − y)!
and the posterior satisfies
p(θ|y) ∝ p(θ)p(y|θ) ∝ p(θ) × θy(1 − θ)n−y .
If the prior density is of the same form, then the posterior density will also be of this form; we parameterize such a prior density as
p(θ) ∝ θa−1(1 − θ)b−1
which is the beta density with parameters a and b.
18
The Beta(a,b) prior distribution can be thought of as providing the equivalent information to a prior successes and b prior failures.
The posterior mean of θ is
a+y a+b a n y
E(θ|y)= a+b+n = a+b+n a+b + a+b+n n and thus always lies between the prior mean a/(a + b) and the
sample proportion y/n. The posterior variance is
var(θ|y) = (a + n)(b + n − y) = E(θ|y)[1 − E(θ|y)] . (a + b + n)2(a + b + n + 1) a + b + n + 1
19
Poisson model for estimating a rate
Let y count the number of events in n units of exposure, and let θ represent the rate at which this event occurs.
Supposing that y ∼ Poisson(nθ), the likelihood function is p(y|θ) = e−nθ(nθ)y
y!
for y = 0, 1, 2, . . ., and the posterior satisfies
p(θ|y) ∝ p(θ)p(y|θ) ∝ p(θ) × θye−nθ .
20
Thus the conjugate prior must be of the form p(θ) ∝ θa−1e−bθ
which is a gamma density with parameters a and b.
The conjugate prior for a Poisson likelihood is θ ∼ Gamma(a, b).
This prior distribution can be thought of as providing the equiv- alent information to a total count of a in b prior exposure units.
21
The posterior density satisfies
p(θ|y) ∝ p(θ)p(y|θ) ∝ θa+y−1e−(b+n)θ
and thus
θ|y ∼ Gamma(a+y,b+n) . The posterior mean of θ is
a+y b a n y E(θ|y)=b+n= b+n b + b+n n
and thus always lies between the prior mean a/b and the observed rate y/n.
The posterior variance is
var(θ|y) = a + y = E(θ|y) . (b + n)2 b + n
22
Estimating the mean of a normal distribution
Let y1, . . . , yn|θ ∼ iid Normal(θ, σ2) where σ2 is known. A conjugate prior for θ is θ ∼ Normal(μ0, τ02).
The posterior density is
1 1
p(θ|y) ∝ p(θ)p(y|θ) ∝ exp −
(θ − μ0)2 +
1 n
2 τ2
(yi − θ)2
0 i=1
Algebraic simplification of this expression (left as an exercise) shows that
θ|y ∼ Normal(μn, τn2)
σ2
23
where
1 μ + n y ̄ τ02 0 σ2
μn = 1 + n τ2 σ2
1 1 n and τn2 = τ2 + σ2 .
0
The posterior mean μn always falls between the prior mean μ0 and the sample mean y ̄.
The posterior precision equals the prior precision plus the data precision.
The Normal(μ0,τ02) prior distribution can be thought of as pro- viding the information equivalent to σ2/τ02 observations with a mean value of μ0.
24
Estimating the variance of a normal distribution
Let y1, . . . , yn|σ2 ∼ iid Normal(θ, σ2) where θ is known. The likelihood is
1n nV
2 −n 2 2−n/2
p(y|σ )∝σ exp −2σ2 (yi−θ) =(σ ) exp −2σ2 i=1
where
V=n
(yi−θ) .
1n 2
i=1
The corresponding conjugate prior density is
p(σ2) ∝ (σ2)−(a+1)e−b/σ2 ,
25
that is, σ2 ∼ Inverse-Gamma(a, b).
A convenient reparameterization is as a scaled inverse-χ2 distri-
bution with scale σ02 and ν0 degrees of freedom; that is, ν0σ02 2
σ2 ∼χν0.
We use the convenient but nonstandard notation
σ2 ∼ Inv-χ2(ν0, σ02) . The resulting posterior density for σ2 is
p(σ2|y) ∝ p(σ2)p(y|σ2)
2 −[(ν +n)/2+1] 1 2
∝(σ ) 0 exp −2σ2(ν0σ0+nV)
26
and thus
ν0σ2 + nV σ2|y ∼ Inv-χ2 ν0 + n, 0 .
ν0 + n
The posterior scale is a degrees-of-freedom-weighted average of
the prior and data scales.
The posterior degrees of freedom equal the sum of the prior and data degrees of freedom.
The Inv-χ2(ν0, σ02) prior distribution can be thought of as provid- ing the information equivalent to ν0 observations with average squared deviation σ02.
27