ECON6300/7320/8300 Advanced Microeconometrics Bayesian Methods
Christiern Rose 1University of Queensland
Lecture 9
1/42
A Thought Experiment:
Experiment: select a person randomly on St. Lucia campus.
Event A := a randomly selected person is taller than 190cm. Then,
Pr(A) =?
Event B := a randomly selected person is female.
A female is selected. Then, what are the odds that she is 190cm or taller?
Pr(A|B) =?
2/42
Subjective Probability
Each of one us can say what we believe about Pr(A), although it might be wrong
When we learn that a female is selected, we may revise our belief about the person being taller than 190cm. Mathematically, we write it Pr(A|B)
Your belief before the additional information, Pr(A), is called the prior.
Your (revised) belief after the information, Pr(A|B), is the posterior.
3/42
Bayes Theorem
Recall that
Pr(A|B) = Pr(A ∩ B) and similarly Pr(B|A) = Pr(A ∩ B)
Pr(B) Pr(A) So, Pr(A ∩ B) = Pr(A) Pr(B|A).
Pr(A|B) = Pr(A) Pr(B|A) Pr(B)
This is a version of Bayes theorem.
Bayes theorem shows how the prior is updated to the posterior when there is additional information.
4/42
Bayes Theorem for PDF
Reprint Bayes theorem:
Pr(A|B) = Pr(A) Pr(B|A) .
Pr(B)
Let π(x) be the marginal PDF of X.
Let p(y) be the marginal PDF of Y.
Let f(y|x) be the conditional PDF of Y given X = x.
By Bayes theorem, we write
π(x|y) = π(x)f(y|x) p(y)
π(x) is the prior on X and π(x|y) is the posterior on X given Y = y
5/42
Bayes Theorem for PDF
If we use θ (the parameter) instead of x: π(θ|y) = π(θ)f(y|θ).
p(y)
π(θ) is the prior on θ and π(θ|y) is the posterior on θ given
Y = y.
The parameter (θ) may not be random. But, it is uncertain.
Our subjective belief about θ is represented as π(θ) (prior belief, before data) and π(θ|y) (posterior belief, after data).
6/42
Bernouilli Example
AcoinistossedandY =1iftheoutcomeisHeadand otherwise Y = 0.
Let θ := Pr(Y = 1), i.e., Y ∼ Bernouilli(θ). Then, f(y|θ)=θy(1−θ)1−y1(y ∈{0,1})
Suppose you believe every θ ∈ (0, 1) is equally likely. Then, the prior is
θ ∼ Uniform(0, 1), i.e., π(θ) = 1(θ ∈ (0, 1))
Then, the posterior is
π(θ|y)=1(θ∈(0,1))·θy(1−θ)1−y /p(y)
If y = 1,
π(θ) f(y|θ)
π(θ|y =1)=1(θ∈(0,1))·θ/p(1) = 2θ · 1(θ ∈ (0, 1))
What is π(θ|y = 0)?
7/42
Bernouilli Example
To be more flexible, we consider the beta prior, i.e., π(θ) = Γ(α + β) · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1))
Γ(α)Γ(β)
Since θ ∼ Beta(α, β), we know that
E[θ]= α α+β
V[θ] = αβ
(α + β)2(α + β + 1)
Suppose you believe that θ = 0.25 on average, i.e., E[θ] = 1/4, and your belief is quite strong, e.g.,
V [θ] = 0.01. Then, (α, β) ≈ (4, 13).
As before, let’s compute π(θ|y = 1).
8/42
Bernouilli Example
To solve this problem, we need to learn a useful trick: π(θ) = Γ(α + β) · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1))
Γ(α)Γ(β)
= c · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1))
where c is the normalising constant (i.e. it does not depend on θ).
Once we know
π(θ) = c · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1)),
we also know c = Γ(α+β) (Why?). Γ(α)Γ(β)
So, even if we simply write
π(θ) ∝ θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1)),
we immediately know
θ ∼ Beta(α, β).
9/42
Bernouilli Example
Bayes theorem:
π(θ|y) = π(θ)f(y|θ)/p(y) ∝ π(θ)f(y|θ)
∝ θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1)) · θy (1 − θ)1−y
∝π(θ) f(y|θ) = θ(α+y)−1(1 − θ)(β+1−y)−1 · 1(θ ∈ (0, 1))
= θα ̃−1(1 − θ)β ̃−1 · 1(θ ∈ (0, 1)) where α ̃ = α + y and β ̃ = β + 1 − y.
The posterior density is
π(θ|y) ∝ θα ̃−1(1 − θ)β ̃−1 · 1(θ ∈ (0, 1))
which means
θ|y ∼ Beta(α ̃, β ̃)
10/42
Bernouilli Example
Since the posterior is Beta(α ̃, β ̃), the posterior mean and variance when y = 1, are
E[θ|y=1]= α ̃ = α+1 , α ̃ + β ̃ α + β + 1
α ̃ β ̃ ( α + 1 ) · β
V[θ|y =1]= (α ̃+β ̃)2(α ̃+β ̃+1) = (α+β+1)2(α+β+2)
Since (α, β) = (4, 13), we have
E[θ|y =1]=0.29>E[θ]=0.25
V[θ|y = 1] = 0.0104 > V[θ] = 0.01
Since y = 1 observed, we believe θ = Pr(y = 1) should be
larger.
Since the data contradicts our prior, uncertainty around our belief on θ increases. The variance increases.
11/42
Example: Normal Model
WhenX ∼N(μ,σ2),thePDFofX isgivenas 1 1 2
√ 2exp −2σ2(x−μ) 2πσ
Suppose y|θ ∼ N(θ,1). The PDF is
1 1 2
f(y|θ)=√2πexp −2(y−θ) TheprioronθisgivenasN(0,102).
1 1 2 π(θ)=√2π102exp −2·102θ
12/42
Example: Normal Model
Posterior density
π(θ|y)∝exp −2·102θ exp −2(y−θ)
∝π(θ) ∝f(y|θ)
12 22
=exp −2 y −2yθ+θ +θ/100 1 2
∝ exp − 2(100/101) θ − (100/101)y
This implies that the conditional distribution of θ given Y = y is normal with mean (100/101)y and variance 100/101, i.e.,
100y 100 θ|y ∼N 101 ,101
Hence, E[θ|Y = y] = 100y and V[θ|Y = y] = 100. 101 101
121 2
13/42
Conjugate Priors
Previously, we considered an example where we observed Y given the unknown mean θ, and θ had a normal prior distribution. The posterior distribution turned out to be normal as well.
This was an example of a conjugate prior distribution, which has the property that the posterior distribution is of the same family as the prior.
Conjugacy is specific to prior and model combination. For example,
– Normal prior is conjugate for Normal model for mean parameter
– Beta prior is conjugate for Bernouilli model
– But, for instance, Beta prior is not conjugate for Normal model.
Conjugate priors are useful: the posterior is a standard distribution with known properties, e.g, posterior mean and posterior variance are easy to compute. (as above)
14/42
Conjugate Priors: an example of non-conjugate prior
Consider Beta prior for the normal model above. The posterior is
α−1 β−1 1 2 π(θ|y)∝θ (1−θ) ·1(θ∈(0,1))·exp − (y−θ)
2
∝Beta(α,β)
∝N (θ,1)
Let π∗(θ|y) denote the RHS, i.e., π(θ|y) = c · π∗(θ|y) for
some c > 0.
There is no standard distribution proportional to π∗(θ|y).
In order to obtain the posterior, we need to find c such that
11∗
π(θ|y)dθ = c · π (θ|y)dθ = 1 00
15/42
Conjugate Priors: an example of non-conjugate prior
Then, the posterior is
π(θ|y) = π∗(θ|y)
Moreover, the posterior mean and posterior variance are
1 1 π∗(θ|y) E[θ|y]= θ·π(θ|y)dθ= θ 1 ∗ ̃ ̃ dθ
0 0 0 π (θ|y)dθ
1 22
1 π ∗ ( θ ̃ | y ) d θ ̃ 0
π∗(θ|y) V[θ|y]=E (θ−E[θ|y]) |y = (θ−E[θ|y]) 1 ∗ ̃ ̃ dθ
0 0 π (θ|y)dθ
The integrals need to be numerically evaluated. These days, computers
are fast and algorithms are good. We will learn a few algorithms later.
Still prior conjugacy is important: there are still limitations on what can be handled by computational algorithms.
16/42
Prior vs Sample
Consider the model and prior as follows Y|θ ∼ N(θ,τ−1)
θ ∼ N(d,h−1)
where τ and h are precisions (precision = inverse
variance). Assume τ is known.
Normal prior for normal mean is conjugate. Turns out the
posterior is
hτ −1 θ|y∼N h+τ d+ h+τ y,(h+τ)
The posterior mean is a weighted average of the prior information and the sample information
The posterior variance is the sum of prior and sample precisions.
17/42
Prior vs Sample, and Improper prior
Reprint the posterior distribution; hτ −1
θ|Y=y∼N h+τ d+ h+τ y,(h+τ) where h is prior precision.
If you are really sure about the value of θ before observing y. Then, your prior precision is high, i.e., h ↑. The weight on the prior mean is large and the weight on the observation is small. So, E[θ|y] ≈ d = E[θ]
On the other hand, if you are uncertain about θ, h would be small, and the weight on the observation would be relatively large.
18/42
Prior vs Sample, and Improper prior
Reprint the posterior distribution; hτ −1
θ|Y=y∼N h+τ d+ h+τ y,(h+τ) where h is prior precision.
If h → 0, the prior density gets flat, putting no weight on the prior mean.
At the limit, π(θ) ∝ 1 ⇒ often called uniform/flat prior, or even noninformative prior.
The flat prior is not a density because it does not integrate to 1. So, it is called an improper prior. But, the posterior is still a density.
19/42
More than one observation
Let Y1,Y2,…,Yn be a random sample from f(·|θ) and we have π(θ).
When only Y1 = y1 is observed, the posterior is given as π(θ|y1) ∝ π(θ)f (y1|θ)
When Y2 = y2 is also observed, π(θ|y1) is the prior to be updated
π(θ|y1, y2) ∝ π(θ|y1) f (y2|θ) ∝ π(θ)f (y1|θ)f (y2|θ)
new prior
When Y3 = y3 is observed, similarly
π(θ|y1, y2, y3) ∝ π(θ|y1, y2) f (y3|θ) ∝ π(θ)f (y1|θ)f (y2|θ)f (y3|θ)
new prior
So, the posterior is proportional to the prior times the likelihood
n
π(θ|y1, . . . , yn) ∝ π(θ) · f (yi |θ)
i=1 20/42
More than one observation
Suppose we have a random sample from the normal model
iid −1 Y1 = y1,…,Yn = yn|θ ∼ N(θ,τ
)
where τ is known and we have the normal prior θ ∼ N(d,h−1).
Then, applying the conjugacy recursively, we have the posterior
h n τ 1 n θ|y1,.,yn ∼N h+nτ d+ h+nτ n yi ,(h+nτ)−1
i=1
For all (d,h) ∈ R×R+, i.e., for all proper priors, as n → ∞,
– the posterior mean gets close to the sample mean – 1 n yi →p θ0, the true population mean (LLN)
n i=1
– the posterior precision explodes
– So, the posterior will be degenerate at θ0 (Consistency)
21/42
Optimal Decision under Uncertainty
Let L(θ, a) be the loss when you choose an action a ∈ A under θ ∈ Θ.
If you knew the true θ, you would minimise L(θ, a).
But, θ is unknown.
iid
You have prior π(θ) and observe y1,…,yn ∼ f(y|θ).
Under axioms of Savage (1954), it is optimal to choose
aB :=argmin L(θ,a)π(θ|y1,…,yn)dθ a∈A Θ
= argminE[L(θ,a)|y1,…,yn] a∈A
which is called the Bayes action. (Subjective Expected Utility Theory)
22/42
Estimation as a Decision Problem
The econometrician would suffer the error square loss L(θ, a) = (θ − a)2
when she reports a ∈ Θ as her estimate when θ is the true.
For this particular problem, it is optimal to choose
θˆB := argmin (θ−a)2π(θ|y1,…,yn)dθ = E[θ|y1,…,yn] a∈Θ Θ
turns out
If L(θ, a) = |θ − a|, it is optimal to report the posterior median.
As seen here, the optimality depends on the loss function (the way you feel about the error you make)
In the example of the normal model, the Bayesian estimate is the posterior mean under either of the loss functions above.
23/42
Summary of Uncertainty
The posterior variance or the posterior standard deviation can be reported as a summary of uncertainty around the estimate.
Alternatively, a 95% credible interval (CI) can also be reported, which is any interval [a, b] such that
b a
There could be many such intervals (not unique).
The narrowest one is generally preferred and we use here.
Interpretation is natural. (compare with 95% confidence interval!)
For the normal model above, the 95% CI is θˆB ±1.96V(θ|y1,…,yn)
0.95 = Pr(θ ∈ [a,b]|y1,…,yn) =
π(θ|y1,…,yn)dθ
24/42
Linear Regression
Consider a regression model
yi =β1+β2xi2+···+βKxiK +ei
= xi′β + ei
where we assume that ei|xi,β ∼ N(0,1/τ), and β and xi are (K × 1) vectors. Since ei = yi − xi′β, we have
yi −xiβxi,β∼N(0,1/τ),
the likelihood is
nn
τexp−τe2∝τnexp −τ(y−x′β)2
2π2i2 2ii
i=1
PDF of N (0,1/τ )
i=1
=τn exp−τ(y−Xβ)′(y−Xβ) 2
2
25/42
Linear Regression
Priors specification;
τ ∼ Gamma (α0/2, 2/λ0)
β|τ ∼ N b0,(τΣ0)−1
where α0 and λ0 are positive scalars b0 is a K vector, and
Σ0 is a K × K positive definite matrix. The prior is
α0−1λ0 kτ′ π(τ,β)∝τ 2 exp − τ ·1(τ >0)×τ2 exp − (β−b0)Σ0(β−b0)
22
∝π(τ)
∝π(β|τ)
26/42
Linear Regression
The posterior is the prior times the likelihood; α0−1 λ0
π(τ,β|y,X)∝τ2 exp −2τ ·1(τ>0)
·τk exp−τ(β−b )′Σ (β−b )
22000
·τn exp−τ(y −Xβ)′(y −Xβ) 2
To derive the posterior, it is useful to know
(y − Xβ)′(y − Xβ) = (y − XβˆLS)′(y − XβˆLS) + (β − βˆLS)′(X′X)(β − βˆLS) where βˆLS is the OLS estimate, i.e., βˆLS = (X′X)−1X′y
2
27/42
Linear Regression
Some additional algebra shows
τ|y,X ∼ Gamma(αn/2,2/λn)
β|τ,y,X ∼ N bn,(τΣn)−1 Σn :=(X′X+Σ0)
b :=Σ−1(X′Xβˆ +Σ b )=(X′X+Σ )−1(X′y+Σ b ) nnLS00 0 00
and αn := α0 +n and λn := λ0 +y′y +b0′ Σ0b0 −bn′ Σnbn
Note that the posterior mean of β is the weighted average
of the OLS estimate βˆLS and the prior b0.
Moreover, conditional on τ, the posterior precision of β is the sum of the OLS precision and the prior precision.
where
28/42
Linear Regression
If the econometrician has the mean squared error, i.e., L(θ,a)=(θ−a)′(θ−a), fora∈Θ
where θ = (β1,…,β2,τ) is a parameter vector, her optimal estimate is the posterior mean.
Each βj for j = 1,…,K is normally distributed under the posterior,
βj|τ,y,X ∼ N bn,j,V(βj|τ,y,X) where V(βj|τ,y,X) is the (j,j) element of (τΣn)−1.
We often summarise the uncertainty about βj by the posterior standard deviation V(βj|τ,y,X) or the (narrowest) 95% credible interval
bn,j ±1.96V(βj|τ,y,X).
29/42
Bayesian Computation
What if the prior is not conjugate? Then there is no analytic expression for the posterior…
Consider a generic problem where θ is the parameter, π(θ) is the prior, and f(y|θ) is the density of Y given θ. Then, the posterior is
π(θ|y) ∝ π(θ)f(y|θ)
where y = (y1,…,yn) is a sample and f(y|θ) is the density
of y given θ.
We are often interested in the posterior moment;
Θ
where h(θ) is a measurable function of θ, e.g., h(θ) = θ gives the posterior mean, h(θ) = (θ − E [θ|y ])2 the posterior variance, and so on.
E[h(θ)|y] =
h(θ)π(θ|y)dθ
30/42
Bayesian Computation
Suppose we can draw θ(1), θ(2), . . . , θ(S) from π(θ|y).
If the sample has a good property (for example, i.i.d), then
S
sample from the posterior (Monte Carlo).
For the Bayesian computation, especially, a Markov chain Monte Carlo approach is widely used.
We outline the Metropolis-Hastings algorithm.
1S p
h(θ(s)) → E[h(θ)|y]
So, we may evaluate the integrals using a simulated
s=1
31/42
Bayesian Computation
Let q(θ ̃|θ) be a conditional density, called the proposal density.
At the sth iteration of the Metropolis-Hastings: 1. Draw a ‘candidate’
θ ̃ ∼ q ( θ ̃ | θ ( s − 1 ) ) 2. Let θ(s) := θ ̃ with probability
π ( θ ̃ | y ) q ( θ ( s − 1 ) | θ ̃ ) Q := min 1, π(θ(s−1)|y) · q(θ ̃|θ(s−1))
and let θ(s) := θ(s−1) with probability 1 − Q.
The normalising constants are cancelled out and the posterior ratio is computable.
32/42
Bayesian Computation
Note that the sequence θ(1), . . . , θ(S) is a Markov chain, i.e., the distribution of θ(s) depends only on the previous value θ(s−1).
Theoretically, the chain θ(1), . . . , θ(S) has a property good enough to approximate the posterior, i.e., regardless of the initial point θ(0),
S
The Metropolis Hastings algorithm is an example of a Markov chain Monte Carlo (MCMC) method.
1S p
s=1
h(θ(s)) → E[h(θ)|y] as S → ∞.
33/42
Bayesian Computation
In practice, the quality of an MCMC outcome can heavily depend on the initial point θ(0), the proposal density q(·|·), and the model f (·|θ).
For example, if θ(0) is far away from the posterior distribution, it may take a long time for the chain to reach to the support of the posterior.
So, we exclude some early draws that seem to be a ‘searching’ phase rather than draws from the posterior, i.e., we discard the burn-in draws.
If the proposal density is similar to the posterior density, the algorithm works well. But, in practice, it is hard to know the shape of the posterior.
34/42
Bayesian Computation
The normal distribution is often used. In particular, q(θ ̃|θ) is the density of N (θ, Ω). Then,
q(θ ̃|θ) ∝ exp−2 σθ .
Since q(θ ̃|θ) = q(θ|θ ̃), the acceptance rate simplifies to
̃ 2 1 θ−θ
π ( θ ̃ | y ) Q = min 1, π(θ(s−1)|y)
.
The algorithm is called the Gaussian Metropolis-Hastings algorithm.
Often t distribution is used.
35/42
Bayesian Computation
Metropolis-Hastings algorithms require the researcher to tune up the proposal density. For the Gaussian MH, tuning parameter is σθ.
If σθ is too big, Q would be small (as θ ̃ would usually be far from θ(s−1)) and the algorithm may not often update. That is, it can be θ(s) = θ(s+1) = θ(s+2) = ··· for a long time.
If σθ is too small, Q is close to one accepting almost all candidates, but the chain moves very slowly.
θ(s) ≈ θ(s+1) ≈ θ(s+2) ≈ ···
Either case, the chain does not effectively explore the posterior.
36/42
Bayesian Computation
It is important to choose σθ that enables the chain to explore the posterior efficiently, but it is not always easy.
So, a number of adaptive methods have been proposed. For example, Haario, Saksman, and Tamminen (2001).
If θ = (θ1,…,θK) is high dimension, it can be extremely hard to tune the proposal density. For the Gaussian MH, a K dimensional covariance matrix has to be chosen.
For a high-dimensional problem, Gibbs sampler is widely used.
Idea is to recursively update one component or a small dimensional sub-vector of θ.
37/42
Bayesian Computation
Gibbs sampler: at each sth iteration,
1. drawθ(s) ∼π(θ1|θ(s−1),…,θ(s−1),data)
12K
2. drawθ(s) ∼π(θ2|θ(s),θ(s−1),…,θ(s−1),data)
213K
3. drawθ(s) ∼π(θ3|θ(s),θ(s),θ(s−1),…,θ(s−1),data)
3124K 4. .
5. drawθ(s) ∼π(θK|θ(s),…,θ(s),data) K1K
At each-substep, the Gibbs sampler updates each θk using either the (conditional) conjugacy or the MH algorithm (called MH within Gibbs).
Each sub-step may update more than one component, i.e., it could update a block.
38/42
Model Selection
We observe a sample y := (y1,…,yN), but don’t know the true model.
We may assume y ∼ f(y|θ) with unknown θ ∈ Θ; Model 1.
Or, we could assume y ∼ g(y|ψ) with unknown ψ ∈ Ψ;
Model 2.
We have two competing models (theories, assumptions),
M := {1, 2}
Suppose we have conditional priors π(θ|m = 1) and π(ψ|m = 2).
39/42
Model Selection
Conditional on m = 1, recall that the posterior of θ is π(θ|m = 1)f(y|θ) π(θ|m = 1)f(y|θ)
π(θ|y,m = 1) = π(θ ̃|m = 1)f(y|θ ̃)dθ ̃ =
h(y|m = 1)
where h(y|m = 1) is the marginal likelihood of model
m = 1.
Similarly, conditional on m = 2,
π(ψ|m = 2)g(y|ψ) π(θ|m = 2)g(y|ψ)
π(ψ|y, m = 2) = π(ψ ̃|m = 2)g(y|ψ ̃)dψ ̃ =
h(y|m = 2)
where h(y|m = 2) is the marginal likelihood of model
m = 1.
Then, the (marginal) likelihood ratio of model 1 relative to
model 2 is
B1,2 := h(y|m = 1) h(y|m = 2)
which is called the Bayes factor.
40/42
Model Selection
Suppose we believe model m ∈ M is true with prior probability π(m). Then, the posterior probability of model m = 1 is given as
π(m = 1|y) = π(m = 1)h(y|m = 1)
π(m = 1)h(y|m = 1) + π(m = 2)h(y|m = 2)
The posterior odd ratio of model 1 to model 2 is π(m=1|y) = π(m=1)h(y|m=1)
π(m = 2|y) π(m = 2)h(y|m = 2)
That is, the posterior odd ratio = prior odd ratio × Bayes
factor
If the posterior odd ratio > 1, we believe model 1 is more reasonable.
BIC and AIC are rough approximations of this formal Bayesian model selection.
41/42
Bayesian Asymptotics
Under mild conditions, posterior is consistent, i.e., the posterior asymptotically degenerates at the true parameter; Doob (1941), Schwartz (1965)
The posterior and the sampling distribution of MLE are asymptotically equivalent under some regularity conditions on the model and the prior. (Bernstein – von Mises theorem)
For large sample, therefore, the Bayesian analysis is robust to the choice of prior. Moreover, since MLE is efficient, Bayes is also efficient.
Asymptotic robustness can be used as a reaction to the criticism that Bayesian analysis is essentially subjective (depends on the prior).
42/42