CS计算机代考程序代写 algorithm Bayesian Introduction to Bayesian computation

Introduction to Bayesian computation
Bayesian Statistics Statistics 4224/5224 Spring 2021
February 23, 2021
1

Agenda
1. Introduction to Bayesian computation (Chapter 10 of Bayesian Data Analysis, by Andrew Gelman et al)
2. Finding posterior modes (Section 13.1 of Gelman et al; also section 3.1.1 of Bayesian Statistical Methods, by Reich and Ghosh)
3. Numerical integration (Section 10.1 of Gelman et al, also section 3.1.2 of Reich and Ghosh)
4. Normal approximations to the posterior distribution (Section 4.1 of Gelman et al, also section 3.1.3 of Reich and Ghosh)
2

Introduction to Bayesian computation
Bayesian computation revolves around two steps: computation of the posterior distribution p(θ|y), and computation of the pos- terior predictive distribution p(y ̃|y).
So far we have considered examples where these could be com- puted analytically in closed form, with simulations performed directly using a combination of preprogrammed routines for stan- dard distributions and numerical computation on grids.
For complicated or unusual models or in high dimensions, how- ever, more elaborate algorithms are required to approximate the posterior distribution.
3

Normalized and unnormalized densities
We refer to the (multivariate) distribution to be simulated as the target distribution and denote it p(θ|y).
We assume that p(θ|y) can be easily computed for any value θ, up to a factor involving only the data y.
That is, we assume there is some easily computable function q(θ|y), an unnormalized density, for which q(θ|y)/p(θ|y) is a con- stant the depends only on y.
For example, we work with the product p(θ)p(y|θ), which is pro- portional to the posterior density.
4

Log densities
To avoid computational overflows and underflows, one should work with logarithms of posterior densities whenever possible.
Exponentiation should be performed only when necessary, and as late as possible.
For example, the ratio of two densities should be computed as the exponential of the difference of the log-densities.
p(θ∗|y) = q(θ∗|y) = exp {log q(θ∗|y) − log q(θ|y)} p(θ|y) q(θ|y)
5

Finding posterior modes
A common point estimator is the maximum a posteriori (MAP) estimator, defined as the value that maximizes the posterior (i.e., the posterior mode),
θˆ = arg max log[p(θ|y)] = arg max {log[p(θ)] + log[p(y|θ)]}
θθ
The MAP estimator requires optimization of the posterior, unlike the posterior mean which requires integration.
The MAP solution can be found using calculus or numerical optimization.
6
MAP

Newton’s method
Newton’s method, also called the Newton-Raphson algorithm, is an iterative approach based on a quadratic Taylor series approx- imation of the log posterior density,
L(θ) = log p(θ|y) .
It is also acceptable to use an unnormalized posterior density.
First, one most choose a starting value θ0. Then, for t = 1,2,3,…:
7

1. Compute L′(θt−1) and L′′(θt−1). The Newton-Raphson step at time t is based on the quadratic approximation to L(θ) centered at θt−1.
2. Set the new iterate, θt, to maximize the quadratic approxi- mation; thus
θt = θt−1 − [L′′(θt−1)]−1L′(θt−1) .
This step is repeated until convergence, i.e., θt ≈ θt−1.
The starting point, θ0, is important: the algorithm is not guaran- teed to converge from all starting values, particularly in regions where −L′′ is not positive definite.
Starting values may be obtained from crude parameter estimates. 8

Quasi-Newton methods
Computation and storage of −L′′ in Newton’s method may be costly.
Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb- Shanno (BFGS) method form an approximation of −L′′ itera- tively using only the gradient information.
R has many optimization routines, including the general-purpose algorithm optim.
9

Example
(Figure 3.1 in Reich and Ghosh)
Likelihood: Let yi|μ, σ ∼ iid Normal(μ, σ2) for i = 1, . . . , n.
Priors: μ ∼ Normal(0, 1002) and σ ∼ Uniform(0, 10).
Data: n = 5 and y = (2.68, 1.19, −0.97, −0.98, −1.03).
R code:
Courseworks → Files → Examples → Example10a 13
10

Numerical integration
Numerical integration, also called ‘quadrature,’ refers to methods in which the integral over continuous functions is evaluated by computing the value of the function at a finite number of points.
By increasing the number of points where the function is evalu- ated, desired accuracy can be obtained.
Numerical integration methods can be divided to stochastic meth- ods (simulation, aka Monte Carlo) and deterministic methods (quadrature).
The posterior expected value of any function h(θ) is defined as
􏰒
E[h(θ|y)] =
where the integral has as many dimensions as θ.
h(θ)p(θ|y)dθ ,
11

If we have posterior draws θs from p(θ|y), we can estimate the integral by the sample average of the h(θs) for s = 1, . . . , S.
Stochastic methods
Stochastic methods, aka simulation, aka Monte Carlo integra- tion, are based on obtaining random samples θs from the desired distribution p(θ|y), and estimating the expectation of any func- tion h(θ) by
􏰒 1􏰄Ss
E[h(θ)|y] = h(θ)p(θ|y)dθ ≈ S
The estimate is stochastic, depending on generated random numbers, but the accuracy of the simulation can be improved by obtaining more samples.
12
h(θ ) .
s=1

Deterministic methods
Deterministic numerical integration methods are based on eval- uating the integrand h(θ)p(θ|y) at selected points θs, based on a weighted version of the previous approximation,
wsh(θ )p(θ |y) ,
with weights ws corresponding to the volume of space repre-
E[h(θ|y] = h(θ)p(θ|y) ≈ S sented by the point θs.
s=1
􏰒 1􏰄Sss
The simplest deterministic method is to evaluate the integrand in a grid with equal weights.
Example
Continue with the nonconjugate normal model in ‘Example 10a’ 13

Distributional approximations
Distributional (analytic) approximations approximate the poste- rior with some simpler parametric distribution, from which inte- grals can be computed directly or by using the approximation as a starting point for simulation-based methods.
Normal approximation to the posterior distribution
If the posterior distribution p(θ|y) is unimodal and rougly sym- metric, it can be convenient to approximate it by a normal distri- bution; that is, the log of the posterior density is approximated by a quadratic function of θ.
14

Bayesian central limit theorem (CLT)
A Taylor series expansion of logp(θ|y) centered at the posterior mode θˆ= θˆ , gives
θ=θˆ
MAP
1 􏰙d2 􏰚 ˆˆTˆ
(θ − θ) where the linear term in the expansion is zero because the log-
log p(θ|y) ≈ log p(θ|y) + 2(θ − θ) dθ2 log p(θ|y) posterior density has zero derivative at its mode.
The remainder terms of higher order fade in importance relative to the quadratic term when θ is close to θˆ and n is large.
15

Since the first term is a constant, and the second term is propor- tional to the log of a normal density, we have the approximate result
θ|y ∼. Normal 􏰋θˆ , [I(θˆ ]−1􏰌
where I(θ) is the observed information,
d2
I(θ) = −dθ2 log p(θ|y) .
MAP
MAP
16

Example
Let y|θ ∼ Binomial(n, θ) where θ ∼ Uniform(0, 1). Then
and thus
p(θ|y) ∝ p(θ)p(y|θ) ∝ θy(1 − θ)n−y log p(θ|y) = y log θ + (n − y) log(1 − θ)
which is maximized at θˆ = y/n. MAP
This maximizer gives the mean for the normal approximation to the posterior distribution.
17

The variance is given by [I(θˆ ]−1 where MAP
d2 d􏰍y n−y􏰎 y n−y I(θ) = −dθ2 logp(θ|y) = −dθ n − 1−θ = θ2 + (1−θ)2
The R code at
Courseworks → Files → Examples → Example04
demonstrates the normal distribution providing successively bet- ter approximations to the true posterior distribution as n in- creases (Figure 3.2 in Reich and Ghosh).
18