The Normal Model
Bayesian Statistics Statistics 4224/5224 Spring 2021
January 26, 2021
1
Estimating the mean of a normal distribution
Let y1, . . . , yn|θ ∼ iid Normal(θ, σ2) where σ2 is known.
A conjugate prior for θ is θ ∼ Normal(μ0, τ02 = σ2/κ0).
The resulting posterior is θ|y ∼ Normal(μn, τn2) where κ 0 μ 0 + n y ̄ 2 σ 2
μn = κ0 + n and τn = κ0 + n .
The Normal(μ0, τ02 = σ2/κ0) prior distribution can be thought of as providing the information equivalent to κ0 observations with a mean value of μ0.
2
Noninformative prior for normal mean
Continuing with the problem of estimating the mean of a normal distribution, using the normal conjugate prior.
If we let the prior precision 1/τ02 → 0 (prior variance τ02 → +∞), we obtain the posterior distribution
σ2 θ|y∼Normaly ̄,n .
This is the posterior distribution that results from a prior distri- bution in which p(θ) is constant for all −∞ < θ < +∞.
But there is no such prior distribution, because that p(θ) inte- grates to +∞, and thus is not a probability density.
3
Estimating the variance of a normal distribution
Let y1, . . . , yn|σ2 ∼ iid Normal(μ, σ2) where μ is known. The conjugate prior density satisfies
ν0σ02 2 2 2 2 σ2 ∼χν0, or σ ∼Inv-χ(ν0,σ0).
The resulting posterior distribution is
ν0σ2 + nV σ2|y ∼ Inv-χ2 ν0 + n, 0
ν0 + n
whereV=1n (yi−μ)2. n i=1
The Inv-χ2(ν0, σ02) prior distribution can be thought of as provid- ing the information equivalent to ν0 observations with average squared deviation σ02.
4
Noninformative prior for normal variance
Continue with the problem of estimating the variance of a normal distribution.
If we let the prior degrees of freedom ν0 → 0, we obtain the posterior
σ2|y∼Inv-χ2(n,V) .
This is the posterior distribution that results from a prior density
that satisfies p(σ2) ∝ 1/σ2.
But there is no such prior distribution, because that p(σ2) inte- grates to +∞, and thus is not a probability density.
5
Improper priors
The prior distributions p(θ) ∝ 1 for a normal mean, and p(σ2) ∝ 1/σ2 for a normal variance, are examples of improper prior dis- tributions.
A prior distribution with density p(θ) ∝ p ̃(θ) is said to be a proper prior if
A prior distribution with density p(θ) ∝ p ̃(θ), where the integral of p ̃ does not converge, is said to be improper.
6
p ̃(θ)dθ < ∞ .
Noninformative priors
In problems where genuine prior information is difficult to come by, an analyst may wish to use a prior distribution that reflects this lack of prior information.
Such a prior distribution may be described as vague, flat, diffuse, or noninformative.
The prior density p(σ2) ∝ 1/σ2 for a normal variance can be called noninformative by this definition, since it can equivalently be expressed as p(log σ) ∝ 1.
Two comments on noinformative prior distributions
7
1. They are generally not uniquely defined. For example, if θ is a success probability to be estimated based on n Bernoulli trials, the uniform distribution, or Beta(1,1), would seem like a good choice for a noninformative prior. Recall how- ever that the Beta(a,b) effectively contributes a additional successes and b additional failures to the data, so wouldn’t Beta(.01,.01) or Beta(.001,.001) be even more noninforma- tive? However, those densities have a bathtub shape, not a flat one, and assign the bulk of the prior probability for θ very close to 0 and very close to 1. Would such a probability distribution reasonably reflect our true prior belief about θ?
In practice, as long as y and n − y are both reasonably large, the prior’s contribution to the Beta(a+y,b+n−y) posterior is not too substantial, and the precise choice of a and b has little practical consequence.
8
2. Despite the ubiquity of the terminology (which we will con- tinue to use in our course despite this), there is no such thing as a noninformative prior. If we want to define ‘noninforma- tive’ to indicate a flat prior density, then we must recognize that a noninformative prior under one paramaterization will necessarily imply a non-uniform, and hence informative, prior in another parameterization.
For example, let θ denote a success probability and φ = θ2 the probability of two consecutive successes.
A Uniform(0,1) prior on θ implies p(φ) = 1/[2√φ] so φ ∼ Beta(1/2, 1) with prior expectation E(φ) = 1/3, thus not a uniform prior at all.
A Uniform(0, 1) prior on φ implies p(θ) = 2θ so θ ∼ Beta(2, 1), with prior expectation E(θ) = 2/3.
9
A caveat on the use of improper priors
If the prior distribution is proper then the posterior will be proper. That is, if p(y|θ) is a likelihood, and p(θ)dθ < ∞, then
p(θ)p(y|θ)dθ < ∞ .
If p(θ) is an improper prior density, then the posterior defined by
p(θ|y) ∝ p(θ)p(y|θ) may be proper, or it may not.
Thus it is crucial, when using improper prior densities, to check that the resulting posterior is proper.
10
Even more unsettling, later in this course we will study methods for simulating random draws from a complex posterior distribu- tion, methods that will “work” even if the posterior is improper.
This is not a good thing!
You won’t get an error message, you’ll just get results that are complete nonsense and won’t even know it.
Only use improper priors if you (or someone you trust) can verify mathematically that the resulting posterior is proper.
11
The two-parameter normal model
Building on the results we developed for estimation of the mean and variance separately, assuming the other parameter was known, we now consider the model
y1, . . . , yn|μ, σ2 ∼ iid Normal(μ, σ2)
and suppose we wish to do Bayesian inference about the two-
dimensional model parameter θ = (μ, σ2). The likelihood for this model can be written
1n p(y|μ, σ2) ∝ (σ2)−n/2 exp − (yi − μ)2
2σ2i=1 2−n/21 2 2
=(σ ) exp −2σ2 (n−1)s +n(y ̄−μ)
12
where s2 = 1 n (yi − y ̄)2 is the sample variance of the yi’s. n−1 i=1
Conjugate prior
The conjugate prior is given by
μ|σ2 ∼ Normal(μ0, σ2/κ0)
σ2 ∼ Inv-χ2(ν0, σ02) . The resulting posterior is given by
μ|σ2, y ∼ Normal(μn, σ2/κn) σ2|y ∼ Inv-χ2(νn, σn2)
13
where
κn = κ0 + n
μ n = κ 0 μ 0 + n y ̄
κn νn = ν0 + n
212 2κ0n2 σn=ν ν0σ0+(n−1)s+κ(y ̄−μ0)
nn
14
Noninformative prior
A sensible vague prior density for μ and σ is uniform on (μ, log σ) or, equivalently,
p(μ, σ2) ∝ (σ2)−1 . The resulting posterior is given by
μ|σ2, y ∼ Normal(y ̄, σ2/n) σ2|y ∼ Inv-χ2(n − 1, s2)
The marginal posterior of μ, integrating over σ2, is given by
μ − y ̄
√ y∼tn−1
s/ n
That is, the marginal posterior of μ is Student’s t-distribution
with location y ̄ and scale s/√n and n − 1 degrees of freedom. 15