程序代写代做代考 C chain flex algorithm Bayesian ECON6300/7320/8300 Advanced Microeconometrics Bayesian Methods

ECON6300/7320/8300 Advanced Microeconometrics Bayesian Methods
Christiern Rose 1University of Queensland
Lecture 9
1/42

A Thought Experiment:
􏰉 Experiment: select a person randomly on St. Lucia campus.
􏰉 Event A := a randomly selected person is taller than 190cm. Then,
Pr(A) =?
􏰉 Event B := a randomly selected person is female.
􏰉 A female is selected. Then, what are the odds that she is 190cm or taller?
Pr(A|B) =?
2/42

Subjective Probability
􏰉 Each of one us can say what we believe about Pr(A), although it might be wrong
􏰉 When we learn that a female is selected, we may revise our belief about the person being taller than 190cm. Mathematically, we write it Pr(A|B)
􏰉 Your belief before the additional information, Pr(A), is called the prior.
􏰉 Your (revised) belief after the information, Pr(A|B), is the posterior.
3/42

Bayes Theorem
􏰉 Recall that
Pr(A|B) = Pr(A ∩ B) and similarly Pr(B|A) = Pr(A ∩ B)
Pr(B) Pr(A) 􏰉 So, Pr(A ∩ B) = Pr(A) Pr(B|A).
Pr(A|B) = Pr(A) Pr(B|A) Pr(B)
􏰉 This is a version of Bayes theorem.
􏰉 Bayes theorem shows how the prior is updated to the posterior when there is additional information.
4/42

Bayes Theorem for PDF
􏰉 Reprint Bayes theorem:
Pr(A|B) = Pr(A) Pr(B|A) .
Pr(B)
􏰉 Let π(x) be the marginal PDF of X.
􏰉 Let p(y) be the marginal PDF of Y.
􏰉 Let f(y|x) be the conditional PDF of Y given X = x.
􏰉 By Bayes theorem, we write
π(x|y) = π(x)f(y|x) p(y)
􏰉 π(x) is the prior on X and π(x|y) is the posterior on X given Y = y
5/42

Bayes Theorem for PDF
􏰉 If we use θ (the parameter) instead of x: π(θ|y) = π(θ)f(y|θ).
p(y)
􏰉 π(θ) is the prior on θ and π(θ|y) is the posterior on θ given
Y = y.
􏰉 The parameter (θ) may not be random. But, it is uncertain.
􏰉 Our subjective belief about θ is represented as π(θ) (prior belief, before data) and π(θ|y) (posterior belief, after data).
6/42

Bernouilli Example
􏰉 AcoinistossedandY =1iftheoutcomeisHeadand otherwise Y = 0.
􏰉 Let θ := Pr(Y = 1), i.e., Y ∼ Bernouilli(θ). Then, f(y|θ)=θy(1−θ)1−y1(y ∈{0,1})
􏰉 Suppose you believe every θ ∈ (0, 1) is equally likely. Then, the prior is
θ ∼ Uniform(0, 1), i.e., π(θ) = 1(θ ∈ (0, 1))
􏰉 Then, the posterior is
π(θ|y)=1(θ∈(0,1))·θy(1−θ)1−y /p(y) 􏰎 􏰍􏰌 􏰏􏰎 􏰍􏰌 􏰏
􏰉 If y = 1,
π(θ) f(y|θ)
π(θ|y =1)=1(θ∈(0,1))·θ/p(1) = 2θ · 1(θ ∈ (0, 1))
􏰉 What is π(θ|y = 0)?
7/42

Bernouilli Example
􏰉 To be more flexible, we consider the beta prior, i.e., π(θ) = Γ(α + β) · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1))
Γ(α)Γ(β)
􏰉 Since θ ∼ Beta(α, β), we know that
E[θ]= α α+β
V[θ] = αβ
(α + β)2(α + β + 1)
􏰉 Suppose you believe that θ = 0.25 on average, i.e., E[θ] = 1/4, and your belief is quite strong, e.g.,
V [θ] = 0.01. Then, (α, β) ≈ (4, 13).
􏰉 As before, let’s compute π(θ|y = 1).
8/42

Bernouilli Example
􏰉 To solve this problem, we need to learn a useful trick: π(θ) = Γ(α + β) · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1))
Γ(α)Γ(β)
= c · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1))
where c is the normalising constant (i.e. it does not depend on θ).
􏰉 Once we know
π(θ) = c · θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1)),
we also know c = Γ(α+β) (Why?). Γ(α)Γ(β)
􏰉 So, even if we simply write
π(θ) ∝ θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1)),
we immediately know
θ ∼ Beta(α, β).
9/42

Bernouilli Example
􏰉 Bayes theorem:
π(θ|y) = π(θ)f(y|θ)/p(y) ∝ π(θ)f(y|θ)
∝ θα−1(1 − θ)β−1 · 1(θ ∈ (0, 1)) · θy (1 − θ)1−y 􏰎 􏰍􏰌 􏰏􏰎 􏰍􏰌 􏰏
∝π(θ) f(y|θ) = θ(α+y)−1(1 − θ)(β+1−y)−1 · 1(θ ∈ (0, 1))
= θα ̃−1(1 − θ)β ̃−1 · 1(θ ∈ (0, 1)) where α ̃ = α + y and β ̃ = β + 1 − y.
􏰉 The posterior density is
π(θ|y) ∝ θα ̃−1(1 − θ)β ̃−1 · 1(θ ∈ (0, 1))
which means
θ|y ∼ Beta(α ̃, β ̃)
10/42

Bernouilli Example
􏰉 Since the posterior is Beta(α ̃, β ̃), the posterior mean and variance when y = 1, are
E[θ|y=1]= α ̃ = α+1 , α ̃ + β ̃ α + β + 1
α ̃ β ̃ ( α + 1 ) · β
V[θ|y =1]= (α ̃+β ̃)2(α ̃+β ̃+1) = (α+β+1)2(α+β+2)
􏰉 Since (α, β) = (4, 13), we have
E[θ|y =1]=0.29>E[θ]=0.25
V[θ|y = 1] = 0.0104 > V[θ] = 0.01
􏰉 Since y = 1 observed, we believe θ = Pr(y = 1) should be
larger.
􏰉 Since the data contradicts our prior, uncertainty around our belief on θ increases. The variance increases.
11/42

Example: Normal Model
􏰉 WhenX ∼N(μ,σ2),thePDFofX isgivenas 1 􏰑1 2􏰒
√ 2exp −2σ2(x−μ) 2πσ
􏰉 Suppose y|θ ∼ N(θ,1). The PDF is
1 􏰑1 2􏰒
f(y|θ)=√2πexp −2(y−θ) 􏰉 TheprioronθisgivenasN(0,102).
1 􏰑 1 2􏰒 π(θ)=√2π102exp −2·102θ
12/42

Example: Normal Model
􏰉 Posterior density
π(θ|y)∝exp −2·102θ exp −2(y−θ)
􏰎 􏰍􏰌 􏰏􏰎 􏰍􏰌 􏰏
∝π(θ) ∝f(y|θ)
􏰖1􏰑2 22􏰒􏰤
=exp −2 y −2yθ+θ +θ/100 􏰑1􏰛 􏰜2􏰒
∝ exp − 2(100/101) θ − (100/101)y
􏰉 This implies that the conditional distribution of θ given Y = y is normal with mean (100/101)y and variance 100/101, i.e.,
􏰑 100y 100 􏰒 θ|y ∼N 101 ,101
􏰉 Hence, E[θ|Y = y] = 􏰗100􏰘y and V[θ|Y = y] = 100. 101 101
􏰑12􏰒􏰑1 2􏰒
13/42

Conjugate Priors
􏰉 Previously, we considered an example where we observed Y given the unknown mean θ, and θ had a normal prior distribution. The posterior distribution turned out to be normal as well.
􏰉 This was an example of a conjugate prior distribution, which has the property that the posterior distribution is of the same family as the prior.
􏰉 Conjugacy is specific to prior and model combination. For example,
– Normal prior is conjugate for Normal model for mean parameter
– Beta prior is conjugate for Bernouilli model
– But, for instance, Beta prior is not conjugate for Normal model.
􏰉 Conjugate priors are useful: the posterior is a standard distribution with known properties, e.g, posterior mean and posterior variance are easy to compute. (as above)
14/42

Conjugate Priors: an example of non-conjugate prior
􏰉 Consider Beta prior for the normal model above. The posterior is
α−1 β−1 􏰑1 2􏰒 π(θ|y)∝θ (1−θ) ·1(θ∈(0,1))·exp − (y−θ)
􏰎 􏰍􏰌 􏰏 2
∝Beta(α,β)
􏰎 􏰍􏰌 􏰏
∝N (θ,1)
􏰉 Let π∗(θ|y) denote the RHS, i.e., π(θ|y) = c · π∗(θ|y) for
some c > 0.
􏰉 There is no standard distribution proportional to π∗(θ|y).
􏰉 In order to obtain the posterior, we need to find c such that
􏰥1􏰥1∗
π(θ|y)dθ = c · π (θ|y)dθ = 1 00
15/42

Conjugate Priors: an example of non-conjugate prior
􏰉 Then, the posterior is
π(θ|y) = π∗(θ|y)
􏰉 Moreover, the posterior mean and posterior variance are
􏰥1 􏰥1 􏱊 π∗(θ|y) 􏱋 E[θ|y]= θ·π(θ|y)dθ= θ 􏰦1 ∗ ̃ ̃ dθ
0 0 0 π (θ|y)dθ
􏰪 􏰫 􏰥1 22
􏰦 1 π ∗ ( θ ̃ | y ) d θ ̃ 0
􏱊 π∗(θ|y) 􏱋 V[θ|y]=E (θ−E[θ|y]) |y = (θ−E[θ|y]) 􏰦1 ∗ ̃ ̃ dθ
0 0 π (θ|y)dθ
􏰉 The integrals need to be numerically evaluated. These days, computers
are fast and algorithms are good. We will learn a few algorithms later.
􏰉 Still prior conjugacy is important: there are still limitations on what can be handled by computational algorithms.
16/42

Prior vs Sample
􏰉 Consider the model and prior as follows Y|θ ∼ N(θ,τ−1)
θ ∼ N(d,h−1)
where τ and h are precisions (precision = inverse
variance). Assume τ is known.
􏰉 Normal prior for normal mean is conjugate. Turns out the
posterior is
􏰑􏰔h􏰕􏰔τ􏰕 −1􏰒 θ|y∼N h+τ d+ h+τ y,(h+τ)
􏰉 The posterior mean is a weighted average of the prior information and the sample information
􏰉 The posterior variance is the sum of prior and sample precisions.
17/42

Prior vs Sample, and Improper prior
􏰉 Reprint the posterior distribution; 􏰑􏰔h􏰕􏰔τ􏰕 −1􏰒
θ|Y=y∼N h+τ d+ h+τ y,(h+τ) where h is prior precision.
􏰉 If you are really sure about the value of θ before observing y. Then, your prior precision is high, i.e., h ↑. The weight on the prior mean is large and the weight on the observation is small. So, E[θ|y] ≈ d = E[θ]
􏰉 On the other hand, if you are uncertain about θ, h would be small, and the weight on the observation would be relatively large.
18/42

Prior vs Sample, and Improper prior
􏰉 Reprint the posterior distribution; 􏰑􏰔h􏰕􏰔τ􏰕 −1􏰒
θ|Y=y∼N h+τ d+ h+τ y,(h+τ) where h is prior precision.
􏰉 If h → 0, the prior density gets flat, putting no weight on the prior mean.
􏰉 At the limit, π(θ) ∝ 1 ⇒ often called uniform/flat prior, or even noninformative prior.
􏰉 The flat prior is not a density because it does not integrate to 1. So, it is called an improper prior. But, the posterior is still a density.
19/42

More than one observation
􏰉 Let Y1,Y2,…,Yn be a random sample from f(·|θ) and we have π(θ).
􏰉 When only Y1 = y1 is observed, the posterior is given as π(θ|y1) ∝ π(θ)f (y1|θ)
􏰉 When Y2 = y2 is also observed, π(θ|y1) is the prior to be updated
π(θ|y1, y2) ∝ π(θ|y1) f (y2|θ) ∝ π(θ)f (y1|θ)f (y2|θ) 􏰎 􏰍􏰌 􏰏
new prior
􏰉 When Y3 = y3 is observed, similarly
π(θ|y1, y2, y3) ∝ π(θ|y1, y2) f (y3|θ) ∝ π(θ)f (y1|θ)f (y2|θ)f (y3|θ)
􏰎 􏰍􏰌 􏰏
new prior
􏰉 So, the posterior is proportional to the prior times the likelihood
n
π(θ|y1, . . . , yn) ∝ π(θ) · 􏰙 f (yi |θ)
i=1 20/42

More than one observation
􏰉 Suppose we have a random sample from the normal model
iid −1 Y1 = y1,…,Yn = yn|θ ∼ N(θ,τ
)
where τ is known and we have the normal prior θ ∼ N(d,h−1).
􏰉 Then, applying the conjugacy recursively, we have the posterior
􏰝 􏰔 h 􏰕 􏰔 n τ 􏰕 􏱊 1 􏰁n 􏱋 􏰞 θ|y1,.,yn ∼N h+nτ d+ h+nτ n yi ,(h+nτ)−1
i=1
􏰉 For all (d,h) ∈ R×R+, i.e., for all proper priors, as n → ∞,
– the posterior mean gets close to the sample mean – 1 􏰀n yi →p θ0, the true population mean (LLN)
n i=1
– the posterior precision explodes
– So, the posterior will be degenerate at θ0 (Consistency)
21/42

Optimal Decision under Uncertainty
􏰉 Let L(θ, a) be the loss when you choose an action a ∈ A under θ ∈ Θ.
􏰉 If you knew the true θ, you would minimise L(θ, a).
􏰉 But, θ is unknown.
􏰉 iid
You have prior π(θ) and observe y1,…,yn ∼ f(y|θ).
􏰉 Under axioms of Savage (1954), it is optimal to choose 􏰥
aB :=argmin L(θ,a)π(θ|y1,…,yn)dθ a∈A Θ
= argminE[L(θ,a)|y1,…,yn] a∈A
which is called the Bayes action. (Subjective Expected Utility Theory)
22/42

Estimation as a Decision Problem
􏰉 The econometrician would suffer the error square loss L(θ, a) = (θ − a)2
when she reports a ∈ Θ as her estimate when θ is the true.
􏰉 For this particular problem, it is optimal to choose 􏰥
θˆB := argmin (θ−a)2π(θ|y1,…,yn)dθ = E[θ|y1,…,yn] a∈Θ Θ 􏰎􏰍􏰌􏰏
turns out
􏰉 If L(θ, a) = |θ − a|, it is optimal to report the posterior median.
􏰉 As seen here, the optimality depends on the loss function (the way you feel about the error you make)
􏰉 In the example of the normal model, the Bayesian estimate is the posterior mean under either of the loss functions above.
23/42

Summary of Uncertainty
􏰉 The posterior variance or the posterior standard deviation can be reported as a summary of uncertainty around the estimate.
􏰉 Alternatively, a 95% credible interval (CI) can also be reported, which is any interval [a, b] such that
􏰥b a
􏰉 There could be many such intervals (not unique).
􏰉 The narrowest one is generally preferred and we use here.
􏰉 Interpretation is natural. (compare with 95% confidence interval!)
􏰉 For the normal model above, the 95% CI is θˆB ±1.96􏰟V(θ|y1,…,yn)
0.95 = Pr(θ ∈ [a,b]|y1,…,yn) =
π(θ|y1,…,yn)dθ
24/42

Linear Regression
􏰉 Consider a regression model
yi =β1+β2xi2+···+βKxiK +ei
= xi′β + ei
where we assume that ei|xi,β ∼ N(0,1/τ), and β and xi are (K × 1) vectors. Since ei = yi − xi′β, we have
􏰽
yi −xiβ􏰽􏰽xi,β∼N(0,1/τ),
the likelihood is
n􏰓􏰝n􏰞
􏰙 τexp􏰛−τe2􏰜∝τnexp −τ􏰁(y−x′β)2
2π2i2 2ii
i=1 􏰎
􏰍􏰌 􏰏
PDF of N (0,1/τ )
i=1
=τn exp􏰛−τ(y−Xβ)′(y−Xβ)􏰜 2
2
25/42

Linear Regression
􏰉 Priors specification;
τ ∼ Gamma (α0/2, 2/λ0)
β|τ ∼ N 􏰛b0,(τΣ0)−1􏰜
where α0 and λ0 are positive scalars b0 is a K vector, and
Σ0 is a K × K positive definite matrix. 􏰉 The prior is
α0−1􏰑λ0􏰒 k􏰛τ′ 􏰜 π(τ,β)∝τ 2 exp − τ ·1(τ >0)×τ2 exp − (β−b0)Σ0(β−b0)
22
􏰎 􏰍􏰌 􏰏
∝π(τ)
􏰎 􏰍􏰌 􏰏
∝π(β|τ)
26/42

Linear Regression
􏰉 The posterior is the prior times the likelihood; α0−1 􏰑λ0􏰒
π(τ,β|y,X)∝τ2 exp −2τ ·1(τ>0)
·τk exp􏰛−τ(β−b )′Σ (β−b )􏰜
22000
·τn exp􏰛−τ(y −Xβ)′(y −Xβ)􏰜 2
􏰉 To derive the posterior, it is useful to know
(y − Xβ)′(y − Xβ) = (y − XβˆLS)′(y − XβˆLS) + (β − βˆLS)′(X′X)(β − βˆLS) where βˆLS is the OLS estimate, i.e., βˆLS = (X′X)−1X′y
2
27/42

Linear Regression
􏰉 Some additional algebra shows
τ|y,X ∼ Gamma(αn/2,2/λn)
β|τ,y,X ∼ N 􏰛bn,(τΣn)−1􏰜 Σn :=(X′X+Σ0)
b :=Σ−1(X′Xβˆ +Σ b )=(X′X+Σ )−1(X′y+Σ b ) nnLS00 0 00
and αn := α0 +n and λn := λ0 +y′y +b0′ Σ0b0 −bn′ Σnbn
􏰉 Note that the posterior mean of β is the weighted average
of the OLS estimate βˆLS and the prior b0.
􏰉 Moreover, conditional on τ, the posterior precision of β is the sum of the OLS precision and the prior precision.
where
28/42

Linear Regression
􏰉 If the econometrician has the mean squared error, i.e., L(θ,a)=(θ−a)′(θ−a), fora∈Θ
where θ = (β1,…,β2,τ) is a parameter vector, her optimal estimate is the posterior mean.
􏰉 Each βj for j = 1,…,K is normally distributed under the posterior,
βj|τ,y,X ∼ N 􏰗bn,j,V(βj|τ,y,X)􏰘 where V(βj|τ,y,X) is the (j,j) element of (τΣn)−1.
􏰉 We often summarise the uncertainty about βj by the posterior standard deviation 􏰟V(βj|τ,y,X) or the (narrowest) 95% credible interval
bn,j ±1.96􏰠V(βj|τ,y,X).
29/42

Bayesian Computation
􏰉 What if the prior is not conjugate? Then there is no analytic expression for the posterior…
􏰉 Consider a generic problem where θ is the parameter, π(θ) is the prior, and f(y|θ) is the density of Y given θ. Then, the posterior is
π(θ|y) ∝ π(θ)f(y|θ)
where y = (y1,…,yn) is a sample and f(y|θ) is the density
of y given θ.
􏰉 We are often interested in the posterior moment;
􏰥
Θ
where h(θ) is a measurable function of θ, e.g., h(θ) = θ gives the posterior mean, h(θ) = (θ − E [θ|y ])2 the posterior variance, and so on.
E[h(θ)|y] =
h(θ)π(θ|y)dθ
30/42

Bayesian Computation
􏰉 Suppose we can draw θ(1), θ(2), . . . , θ(S) from π(θ|y).
􏰉 If the sample has a good property (for example, i.i.d), then
S
sample from the posterior (Monte Carlo).
􏰉 For the Bayesian computation, especially, a Markov chain Monte Carlo approach is widely used.
􏰉 We outline the Metropolis-Hastings algorithm.
1􏰁S p
h(θ(s)) → E[h(θ)|y]
􏰉 So, we may evaluate the integrals using a simulated
s=1
31/42

Bayesian Computation
􏰉 Let q(θ ̃|θ) be a conditional density, called the proposal density.
􏰉 At the sth iteration of the Metropolis-Hastings: 1. Draw a ‘candidate’
θ ̃ ∼ q ( θ ̃ | θ ( s − 1 ) ) 2. Let θ(s) := θ ̃ with probability
􏰅 π ( θ ̃ | y ) q ( θ ( s − 1 ) | θ ̃ ) 􏲏 Q := min 1, π(θ(s−1)|y) · q(θ ̃|θ(s−1))
and let θ(s) := θ(s−1) with probability 1 − Q.
􏰉 The normalising constants are cancelled out and the posterior ratio is computable.
32/42

Bayesian Computation
􏰉 Note that the sequence θ(1), . . . , θ(S) is a Markov chain, i.e., the distribution of θ(s) depends only on the previous value θ(s−1).
􏰉 Theoretically, the chain θ(1), . . . , θ(S) has a property good enough to approximate the posterior, i.e., regardless of the initial point θ(0),
S
􏰉 The Metropolis Hastings algorithm is an example of a Markov chain Monte Carlo (MCMC) method.
1􏰁S p
s=1
h(θ(s)) → E[h(θ)|y] as S → ∞.
33/42

Bayesian Computation
􏰉 In practice, the quality of an MCMC outcome can heavily depend on the initial point θ(0), the proposal density q(·|·), and the model f (·|θ).
􏰉 For example, if θ(0) is far away from the posterior distribution, it may take a long time for the chain to reach to the support of the posterior.
􏰉 So, we exclude some early draws that seem to be a ‘searching’ phase rather than draws from the posterior, i.e., we discard the burn-in draws.
􏰉 If the proposal density is similar to the posterior density, the algorithm works well. But, in practice, it is hard to know the shape of the posterior.
34/42

Bayesian Computation
􏰉 The normal distribution is often used. In particular, q(θ ̃|θ) is the density of N (θ, Ω). Then,
q(θ ̃|θ) ∝ exp−2 σθ .
􏰉 Since q(θ ̃|θ) = q(θ|θ ̃), the acceptance rate simplifies to
 􏰝 ̃ 􏰞2 1 θ−θ
􏰅 π ( θ ̃ | y ) 􏲏 Q = min 1, π(θ(s−1)|y)
.
􏰉 The algorithm is called the Gaussian Metropolis-Hastings algorithm.
􏰉 Often t distribution is used.
35/42

Bayesian Computation
􏰉 Metropolis-Hastings algorithms require the researcher to tune up the proposal density. For the Gaussian MH, tuning parameter is σθ.
􏰉 If σθ is too big, Q would be small (as θ ̃ would usually be far from θ(s−1)) and the algorithm may not often update. That is, it can be θ(s) = θ(s+1) = θ(s+2) = ··· for a long time.
􏰉 If σθ is too small, Q is close to one accepting almost all candidates, but the chain moves very slowly.
θ(s) ≈ θ(s+1) ≈ θ(s+2) ≈ ···
􏰉 Either case, the chain does not effectively explore the posterior.
36/42

Bayesian Computation
􏰉 It is important to choose σθ that enables the chain to explore the posterior efficiently, but it is not always easy.
􏰉 So, a number of adaptive methods have been proposed. For example, Haario, Saksman, and Tamminen (2001).
􏰉 If θ = (θ1,…,θK) is high dimension, it can be extremely hard to tune the proposal density. For the Gaussian MH, a K dimensional covariance matrix has to be chosen.
􏰉 For a high-dimensional problem, Gibbs sampler is widely used.
􏰉 Idea is to recursively update one component or a small dimensional sub-vector of θ.
37/42

Bayesian Computation
􏰉 Gibbs sampler: at each sth iteration,
1. drawθ(s) ∼π(θ1|θ(s−1),…,θ(s−1),data)
12K
2. drawθ(s) ∼π(θ2|θ(s),θ(s−1),…,θ(s−1),data)
213K
3. drawθ(s) ∼π(θ3|θ(s),θ(s),θ(s−1),…,θ(s−1),data)
3124K 4. .
5. drawθ(s) ∼π(θK|θ(s),…,θ(s),data) K1K
􏰉 At each-substep, the Gibbs sampler updates each θk using either the (conditional) conjugacy or the MH algorithm (called MH within Gibbs).
􏰉 Each sub-step may update more than one component, i.e., it could update a block.
38/42

Model Selection
􏰉 We observe a sample y := (y1,…,yN), but don’t know the true model.
􏰉 We may assume y ∼ f(y|θ) with unknown θ ∈ Θ; Model 1.
􏰉 Or, we could assume y ∼ g(y|ψ) with unknown ψ ∈ Ψ;
Model 2.
􏰉 We have two competing models (theories, assumptions),
M := {1, 2}
􏰉 Suppose we have conditional priors π(θ|m = 1) and π(ψ|m = 2).
39/42

Model Selection
􏰉 Conditional on m = 1, recall that the posterior of θ is π(θ|m = 1)f(y|θ) π(θ|m = 1)f(y|θ)
π(θ|y,m = 1) = 􏰦 π(θ ̃|m = 1)f(y|θ ̃)dθ ̃ =
h(y|m = 1)
where h(y|m = 1) is the marginal likelihood of model
m = 1.
􏰉 Similarly, conditional on m = 2,
π(ψ|m = 2)g(y|ψ) π(θ|m = 2)g(y|ψ)
π(ψ|y, m = 2) = 􏰦 π(ψ ̃|m = 2)g(y|ψ ̃)dψ ̃ =
h(y|m = 2)
where h(y|m = 2) is the marginal likelihood of model
m = 1.
􏰉 Then, the (marginal) likelihood ratio of model 1 relative to
model 2 is
B1,2 := h(y|m = 1) h(y|m = 2)
which is called the Bayes factor.
40/42

Model Selection
􏰉 Suppose we believe model m ∈ M is true with prior probability π(m). Then, the posterior probability of model m = 1 is given as
π(m = 1|y) = π(m = 1)h(y|m = 1)
π(m = 1)h(y|m = 1) + π(m = 2)h(y|m = 2)
􏰉 The posterior odd ratio of model 1 to model 2 is π(m=1|y) = π(m=1)h(y|m=1)
π(m = 2|y) π(m = 2)h(y|m = 2)
􏰉 That is, the posterior odd ratio = prior odd ratio × Bayes
factor
􏰉 If the posterior odd ratio > 1, we believe model 1 is more reasonable.
􏰉 BIC and AIC are rough approximations of this formal Bayesian model selection.
41/42

Bayesian Asymptotics
􏰉 Under mild conditions, posterior is consistent, i.e., the posterior asymptotically degenerates at the true parameter; Doob (1941), Schwartz (1965)
􏰉 The posterior and the sampling distribution of MLE are asymptotically equivalent under some regularity conditions on the model and the prior. (Bernstein – von Mises theorem)
􏰉 For large sample, therefore, the Bayesian analysis is robust to the choice of prior. Moreover, since MLE is efficient, Bayes is also efficient.
􏰉 Asymptotic robustness can be used as a reaction to the criticism that Bayesian analysis is essentially subjective (depends on the prior).
42/42