Noninformative prior distributions
Bayesian Statistics Statistics 4224/5224 Spring 2021
January 21, 2021
1
Estimating the mean of a normal distribution
Let y1, . . . , yn|θ ∼ iid Normal(θ, σ2) where σ2 is known.
A conjugate prior for θ is θ ∼ Normal(μ0, τ02 = σ2/κ0).
The resulting posterior is θ|y ∼ Normal(μn, τn2) where κ 0 μ 0 + n y ̄ 2 σ 2
μn = κ0 + n and τn = κ0 + n .
The Normal(μ0, τ02 = σ2/κ0) prior distribution can be thought of as providing the information equivalent to κ0 observations with a mean value of μ0.
2
Noninformative prior for normal mean
Continuing with the problem of estimating the mean of a normal distribution, using the normal conjugate prior.
If we let the prior precision 1/τ02 → 0 (prior variance τ02 → +∞), we obtain the posterior distribution
σ2 θ|y∼Normaly ̄,n .
This is the posterior distribution that results from a prior distri- bution in which p(θ) is constant for all −∞ < θ < +∞.
But there is no such prior distribution, because that p(θ) inte- grates to +∞, and thus is not a probability density.
3
Estimating the variance of a normal distribution
Let y1, . . . , yn|σ2 ∼ iid Normal(μ, σ2) where μ is known. The conjugate prior density satisfies
ν0σ02 2 2 2 2 σ2 ∼χν0, or σ ∼Inv-χ(ν0,σ0).
The resulting posterior distribution is
ν0σ2 + nV σ2|y ∼ Inv-χ2 ν0 + n, 0
ν0 + n
whereV=1n (yi−μ)2. n i=1
The Inv-χ2(ν0, σ02) prior distribution can be thought of as provid- ing the information equivalent to ν0 observations with average squared deviation σ02.
4
Noninformative prior for normal variance
Continuing with the problem of estimating the variance of a normal distribution.
If we let the prior degrees of freedom ν0 → 0, we obtain the posterior
σ2|y∼Inv-χ2(n,V) .
This is the posterior distribution that results from a prior density
that satisfies p(σ2) ∝ 1/σ2.
But there is no such prior distribution, because that p(σ2) inte- grates to +∞, and thus is not a probability density.
5
Improper priors
The prior distributions p(θ) ∝ 1 for a normal mean, and p(σ2) ∝ 1/σ2 for a normal variance, are examples of improper prior dis- tributions.
A prior distribution with density p(θ) ∝ p ̃(θ) is said to be a proper prior if
A prior distribution with density p(θ) ∝ p ̃(θ), where the integral of p ̃ does not converge, is said to be improper.
6
p ̃(θ)dθ < ∞ .
Noninformative priors
In problems where genuine prior information is difficult to come by, an analyst may wish to use a prior distribution that reflects this lack of prior information. Such a prior distribution may be described as vague, flat, diffuse, or noninformative.
The prior density p(σ2) ∝ 1/σ2 for a normal variance can be called noninformative by this definition, since it can equivalently be expressed as p(log σ) ∝ 1.
Two comments on noinformative prior distributions
7
1. They are generally not uniquely defined. For example, if θ is a success probability to be estimated based on n Bernoulli trials, the uniform distribution, or Beta(1,1), would seem like a good choice for a noninformative prior. Recall how- ever that the Beta(a,b) effectively contributes a additional successes and b additional failures to the data, so wouldn’t Beta(.01,.01) or Beta(.001,.001) be even more noninforma- tive? However, those densities have a bathtub shape, not a flat one, and assign the bulk of the prior probability for θ very close to 0 and very close to 1. Would such a probability distribution reasonably reflect our true prior belief about θ?
In practice, as long as y and n − y are both reasonably large, the prior’s contribution to the Beta(a+y,b+n−y) posterior is not too substantial, and the precise choice of a and b has little practical consequence.
8
2. Despite the ubiquity of the terminology (which we will con- tinue to use in our course despite this), there is no such thing as a noninformative prior. If we want to define “noninforma- tive” to indicate a flat prior density, then we must recognize that a noninformative prior under one paramaterization will necessarily imply a non-uniform, and hence informative, prior in another parameterization.
For example, let θ denote a success probability and φ = θ2 the probability of two consecutive successes.
A Uniform(0,1) prior on θ implies p(φ) = 1/[2√φ] so φ ∼ Beta(1/2, 1) with prior expectation E(φ) = 1/3, thus not a uniform prior at all.
A Uniform(0, 1) prior on φ implies p(θ) = 2θ so θ ∼ Beta(2, 1), with prior expectation E(θ) = 2/3.
9
A caveat on the use of improper priors
If the prior distribution is proper then the posterior will be proper. That is, if p(y|θ) is a likelihood, and p(θ)dθ < ∞, then
p(θ)p(y|θ)dθ < ∞ .
If p(θ) is an improper prior density, then the posterior defined by
p(θ|y) ∝ p(θ)p(y|θ) may be proper, or it may not.
Thus it is crucial, when using improper prior densities, to check that the resulting posterior is proper.
10
Even more unsettling, later in this course we will study methods for simulating random draws from a complex posterior distribu- tion, methods that will “work” even if the posterior is improper.
This is not a good thing!
You won’t get an error message, you’ll just get results that are complete nonsense and won’t even know it.
Only use improper priors if you (or someone you trust) can verify mathematically that the resulting posterior is proper.
11
Homework assignment 1
Homework 1 is nominally due before class on Tue, Jan 26. Courseworks will accept submissions through end of day on Wed, Jan 27, after which no late papers will be accepted.
• Problem 1 is a probability exercise — there is no computing required.
• On problem 2, I suggest you use a discrete approximation. That is, while the correlation coefficient can take any value 0 < ρ < 1, I suggest you only compute the posterior p(ρ|x, y)
for values of ρ ∈ {.001, .002, . . . , .999}. There is no need to “solve” the normalizing constant, you can approximate it numerically.
12
• Problem 3: Exact answers are available for part (a), and you can use the R function ‘qbeta’ for part (b). For part (c) use calculus to solve:
Pr(y ̃ > 0|y) =
Pr(y ̃ > 0|θ)p(θ|y)dθ
• Problem 4: In part (b), it is fine (preferred in fact) to express your answer in terms of the beta density function
dbeta(y|a,b) = Γ(a + b)ya−1(1 − y)b−1 for 0 < y < 1 , Γ(a)Γ(b)
just as you will deploy the R function ‘dbeta’ for the graphical and computational parts of the problem.
13
• On problem 5, use the Poisson distribution. So if Y = num- ber of times the word ‘upon’ is used in a 1000-word text, then Y is distributed as Poisson(3.24) if the text was written by Hamilton and Poisson(0.23) if it was written by Madison. This problem is a straigthforward application of Bayes’ rule.
• Problem 6: Let Yi = number of clutch free throws made by the ith player, so Yi ∼ indep Binomial(ni,θi), where ni is the number of clutch attempts and θi is the ith player’s
true clutch free throw probability, for i = 1, . . . , 10. Assign θ = (θ1, . . . , θ10) a uniform prior. In part (b), arrange the 10 plots as a 2×5 matrix in a single display, and clearly label each plot. For part (c) make a table of posterior medians as well as equal-tailed 50% and 95% posterior intervals for each θi; use the R function ‘qbeta’ with p=c(.025, .25, .5, .75, .975).
14
The two-parameter normal model
Building on the results we developed for estimation of the mean and variance separately, assuming the other parameter was known, we now consider the model
y1, . . . , yn|μ, σ2 ∼ iid Normal(μ, σ2)
and suppose we wish to do Bayesian inference about the two-
dimensional model parameter θ = (μ, σ2). The likelihood for this model can be written
1n p(y|μ, σ2) ∝ (σ2)−n/2 exp − (yi − μ)2
2σ2i=1 2−n/21 2 2
=(σ ) exp −2σ2 (n−1)s +n(y ̄−μ)
15
where s2 = 1 n (yi − y ̄)2 is the sample variance of the yi’s. n−1 i=1
Conjugate prior
The conjugate prior is given by
μ|σ2 ∼ Normal(μ0, σ2/κ0)
σ2 ∼ Inv-χ2(ν0, σ02) . The resulting posterior is given by
μ|σ2, y ∼ Normal(μn, σ2/κn) σ2|y ∼ Inv-χ2(νn, σn2)
16
where
κn = κ0 + n
μ n = κ 0 μ 0 + n y ̄
κn νn = ν0 + n
212 2κ0n2 σn=ν ν0σ0+(n−1)s+κ(y ̄−μ0)
nn
17
Noninformative prior
A sensible vague prior density for μ and σ is uniform on (μ, log σ) or, equivalently,
p(μ, σ2) ∝ (σ2)−1 . The resulting posterior can be specified by
and
μ|σ2, y ∼ Normal(y ̄, σ2/n) σ2|y ∼ Inv-χ2(n − 1, s2) .
18