Gamma-Poisson Model and Posterior Inference.
Gamma-Poisson Model and Posterior Inference.
Elena Moltchanova
STAT314/461-2021S1
Sandwiches:
Table 1: Sandwich Demand
Day x
Monday 50
Tuesday 65
Wednesday 72
Thursday 63
Friday 70
How to estimate the average daily demand for sandwiches and to
provide some uncertainty bounds for them?
Poisson Likelihood
Let xi denote the number of sandwiches sold on day i , then
xi |µ ∼ Pois(µ), for i = 1, .., n.
where n is the total number of recorded days, and µ is the Poisson
intensity parameter, corresponding to the expected number of
sandwiches sold on any given day.
Classical Analysis: Likelihood
f (xi |µ) =
µxi e−µ
xi !
.
Classical Analysis: Joint Likelihood
f (x|µ) = f (x1, …, xn|µ) =
∏
i
f (xi |µ) =
∏
i
µxi e−µ
xi !
=
µ
∑
i xi e−nµ∏
i xi !
.
Maximizing the likelihood:
I Log the likelihood
I Find the first derivative w.r.t. µ and solve for 0. Call the
solution µ̂
I Find the second derivative and check that it is negative (for
maximum) at µ̂.
I The solution µ̂ is the Maximum Likelihood Estimate.
Maximizing the likelihood:
Poisson MLE:
µ̂ = x̄ = 64.
Bayesian Derivation: What is the prior?
The Poisson intensity parameter µ is positive.
µ|a, b ∼ Gamma(a, b),
i.e.
f (µ|a, b) =
ba
Γ(a)
µa−1e−bµ for µ > 0,
where Γ(a) is a gamma function.
Applying Bayes’ Theorem
Posterior distribution
µ|x, a, b ∼ Gamma(a +
∑
i
xi , b + n).
Conjugate posterior distributions:
In the previous derivation, we combined a gamma prior with a
Poisson likelihood to obtain a gamma posterior distribution.
Therefore, gamma distribution is a conjugate prior for the Poisson
likelihood.
Conjugate priors are often selected for convenience, although with
the advance of numerical methods and computer technology they
have become less important.
Gamma distributions: Gamma(a, b)
0.00
0.05
0.10
0.15
0 25 50 75 100 125
µ
f(µ
)
Prior
Gamma( 3, 0.2)
Gamma( 70, 1)
Gamma(0.01, 0.01)
Gamma(700, 10)
0.00
0.01
0.02
0.03
0.04
0.05
0 25 50 75 100 125
µ
f(µ
)
E (µ|a, b) =
a
b
=
70
1
= 70
Var(µ|a, b) =
a
b2
=
70
12
= 70
Quantiles:
qgamma(0.025,700,10)
## [1] 64.90978
qgamma(0.975,700,10)
## [1] 75.27963
Figuring out the Prior:
“My sandwich shops sells about 50± 2 sandwiches daily”.
The width of the interval which supports 95% of the probability
mass is ≈ 4sd .
Find a and b such that
a
b
= 50 and
a
b2
= 1.
I.e., a = 2500 and b = 50.
Plotting the Prior (Just to Check).
0.0
0.1
0.2
0.3
0.4
40 45 50 55 60
µ
f(µ
)
Back to sandwiches: Posterior distribution
µ|x , a, b ∼ Gamma(a +
∑
i
xi , b + n),
i.e.
µ|x , a, b ∼ Gamma(1020, 15).
Summarizing the posterior distribution:
posterior mean and variance:
E (µ|y , a, b) =
a
b
=
1020
15
= 68.
Var(µ|y , a, b) =
a
b2
=
1020
152
≈ 4.53 ≈ 2.132.
And posterior 95% credible interval: which supports 95% of the
posterior density i.e. contains the true value of the parameter w.p.
0.95. (contrast with the classical definition).
qgamma(c(0.025,.975),1020,15);
## [1] 63.89042 72.23586
Writing the report:
Based on our analysis, the posterior expected mean daily
demand is 68 sandwiches (95%CI : 63.8, 72.2).
Note, that if you were doing it the classical way, employing the
Central Limit Theorem and the assumption of Poisson distribution
(E (x) = Var(x) = µ), you would obtain the mean demand estimate
of µ̂ = x̄ = 64 (s.e. =
√
x̄/n ≈ 3.6), and the 95% confidence
interval of 64± 1.96 ∗ 3.6, i.e.,(57.0, 71.00).
Summary (so far):
I The object of Bayesian inference is the posterior distribution of
the parameter of interest.
I The posterior distribution is obtained by combining the
data-generating mechanism (likelihood) with the prior
information.
I The posterior distribution can be updated continuously, one
observation or one batch of observations at a time.
I The posterior distribution can be summarised via posterior
mean, variance and the Bayesian 95% Credible Interval.
I The final result will depend on the prior to some extent.
Is Bayesian statistics subjective ???
. . . what about classical statistics?
Different priors . . .
0.00
0.05
0.10
0.15
0 25 50 75 100 125
µ
f(µ
)
Prior
Gamma( 3, 0.2)
Gamma( 70, 1)
Gamma(0.01, 0.01)
Gamma(700, 10)
. . . Different posteriors
0.00
0.05
0.10
0.15
0 25 50 75 100 125
µ
f(µ
)
Prior
Gamma( 3, 0.2)
Gamma( 70, 1)
Gamma(0.01, 0.01)
Gamma(700, 10)
. . . Different posteriors -2
0.00
0.05
0.10
0.15
50 60 70 80
µ
f(µ
)
Prior
Gamma( 3, 0.2)
Gamma( 70, 1)
Gamma(0.01, 0.01)
Gamma(700, 10)
. . . Different posteriors – 3
Prior a b P. Mean P. Var 2.5% 97.5%
Gamma( 3, 0.2) 3 0.02 64.34 12.8173 57.52 71.55
Gamma( 70, 1) 70 1.00 65.00 10.8333 58.71 71.61
Gamma(0.01, 0.01) 0.01 0.01 63.87 12.7494 57.07 71.06
Gamma(700, 10) 700 10.00 68.00 4.5333 63.89 72.24
Another look at the posterior distribution:
µ|x, a, b ∼ Gamma(a +
∑
i
xi , b + n).
So, the posterior mean is
E (µ|x, a, b) =
a +
∑
i xi
b + n
=
a + nx̄
b + n
.
As more and more data come in, and the sample size becomes larger
and larger,
lim
n→∞
E (µ|x, a, b) = lim
n→∞
a + nx̄
b + n
= x̄ .
So. . .
I there is a bit of subjectivity in any data analysis
I the choice of priors matters
I but it matters less when there is a lot of data.
A restrictive prior:
Recall, Bayes’ Formula:
f (θ|x) =
f (x |θ)f (θ)∫
f (x |θ)f (θ)dθ
∝ f (x |θ)f (θ).
and consider the following prior distribution function:
f (θ) =
{
1/(b − a) for θ ∈ (a, b)
0 otherwise ,
for some a < b. A limited posterior: f (θ|x) ∝ { f (x |θ) for θ ∈ (a, b) 0 otherwise . Central Credible Interval vs. Highest Density Region. I The central credible interval is delimited by the 2.5%th quantile from below and the 97.5%th quatnile from above. I The highest density interval is the smallest interval (or a union of intervals with the shortest summary length) supporting 95% of the probability mass. Gamma(20,1) 0.000 0.025 0.050 0.075 10 20 30 40 θ f(θ ) Gamma(20,1) I 95% (C)CI: 29.67− 12.22 = 17.45. I An interval between the 1.4% quantile and 96.4% quantile: 28.75− 11.47 = 17.28. I The narrowest 95% interval between the 1.63% quantile and 96.63% quantile: 28.92− 11.65 = 17.26. Beta(0.5,0.5) - 1 0 1 2 3 0.00 0.25 0.50 0.75 1.00 x f(θ ) Beta(0.5,0.5) - 2 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 θ f(θ ) Summary so far. . . I There are many choices to be made in data analysis. They are subjective to some extent, but it’s OK as long as these choices are clearly presented and explained. I In general, the more data you have, the less will the choice of priors matter. In many cases, the larger is your sample, the closer will your posterior mean be to the classical maximum likelihood estimator. I You can use non-informative (also called vague or weak) prior distribution if you do not have much information about the phenomenon a priori. I You can use strong priors, if you have a good reason for it and explain it well. I The prior distribution has to be defined before you collect the data. Don’t use your data to define your prior! I Sensitivity analysis can be performed to see how influential the choice of prior is. I There are two types of credible intervals: central and highest density. For unimodal symmetric distributions, they will be one and the same. Central CIs are often easier to compute and they are commonly used by default. Posterior predictive distribution Let x̃ denote a new, as yet unobserved, datum (demand for sandwiches tomorrow, say). Then, given the data x = {x1, ..., xn} observed so far, f (x̃ |x) = ∫ f (x̃ |µ)f (µ|x)dµ. The distribution of x̃ given the data x is called the posterior predictive distribution. You are predicting the future observation after observing some data. Note that this is different from the posterior distribution of the parameter µ. Hypothesis Testing: Is the average demand on Tuesday higher than the average demand on Monday? Day-specific posterior distribution: Table 3: Sandwich Demand Day x post.a post.b post.mean post.lo post.hi Monday 50 750 11 68.2 63.4 73.1 Tuesday 65 765 11 69.5 64.7 74.6 Wednesday 72 772 11 70.2 65.3 75.2 Thursday 63 763 11 69.4 64.5 74.4 Friday 70 770 11 70.0 65.1 75.0 Answering the research questions: Based on the observations so far, which day of the week has the highest average demand? Summary so far I It is a common mistake to think about statistics in terms of applying method to the data. I Instead, you need to think about answering the actual research question. (And figuring out what data and what methods would get you there.) I Posterior distribution concerns the parameter of interest, such as expected Poisson intensity µ|x . Posterior predictive distribution concerns the new observations x̃|x . You may think of expected demand as a long-term average, and a new observation as demand on any given day. The latter will be much more variable. I Simulations and manipulations of samples from posterior distribution are an important part of making Bayesian statistics truly versatile and flexible. I It is customary to derive posterior Bayesian probabilities to support various statements. Sometimes, they are called Bayesian P-values. Keep in mind, however, that the interpretation of Bayesian posterior probabilities is completely different from the interpretation of the frequentist p-values. I In particular, in Bayesian inference, higher posterior probabilities mean that the statement is better supported by the data. Whereas in frequentist inference, p < 0.05 is usually taken as the evidence of strong support for the statement.