CS代考 MAST20005) & Elements of Statistics (MAST90058)

Bayesian methods
(Module 9)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
School of Mathematics and Statistics University of Melbourne

Semester 2, 2022

Aims of this module
• Explain two different ways to use probability for modelling • Introduce the Bayesian approach to statistical inference
• Review the probability tools required to carry this out
• Show examples of Bayesian inference for simple models
• Discuss how to chose an appropriate prior
• Compare and contrast Bayesian & classical inference

Review of probability
Interpretations of probability
Bayesian inference: an introduction The Bayesian ‘recipe’
Using the posterior
Bayesian inference: further examples Normal
Binomial Other
Prior distributions
Comparing Bayesian & classical inference

From our last lecture. . .
• Disease testing example • Tree diagrams

Review some probability definitions
• Let A and B be two events
• Often these are in terms of random variables, e.g. A = ‘X = 3’ • Joint probability
Pr(A, B) = Pr(A ∩ B) = Pr(A and B both occur) • Marginal probability
Pr(A) = Pr(A occurs irrespective of B) = Pr(A, B) + Pr(A, B ̄) • Conditional probability
Pr(A | B) = Pr(A occurs given that B occurs) = Pr(A, B) Pr(B)

Bayes’ theorem
Pr(B | A) = Pr(A | B) Pr(B) Pr(A)
The denominator can be written out using:
Pr(A) = Pr(A, B) + Pr(A, B ̄)
= Pr(A | B) Pr(B) + Pr(A | B ̄) Pr(B ̄)

Partitions
• Let B1,B2,…,Bk be a partition of the sample space
• This ’splits up’ the sample space into distinct events
• More precisely, the events cover the whole sample space (B1∪B2∪···∪Bk =Ω)andaremutuallyexclusive (Bi∩Bj =∅wheni̸=j).
• Example: roll a die and let the outcome be X, the events B1 = ‘X is even’ and B2 = ‘X is odd’ form a partition.
• The law of total probability relates marginal and conditional probabilities,
Pr(A) = 􏰃 Pr(A, Bi) = 􏰃 Pr(A | Bi) Pr(Bi)

Continuous random variables
Analogous definitions in terms of density functions (for rvs X and Y ): • Joint pdf
• Marginal pdf (law of total probability) 􏰑∞ 􏰑∞
f(x) = f(x,y)dy = f(x | y)f(y)dy −∞ −∞
• Conditional pdf
• Bayes’ theorem
f(x|y)=f(x|Y =y)= f(x,y) f(y)
f(x | y) = f(y | x)f(x) f(y)

How do we use probability?
• Modelling variation (frequentist probability)
• Representing uncertainty (Bayesian probability)
Classical inference only uses frequentist probability Bayesian inference uses both

Frequentist probability
• The relative frequency of occurrence in the long run, under hypothetical repetitions an experiment
• This is what we usually have in mind when devising a statistical model for the data
• Example: X ∼ N(μ, σ2), specifies a model for variation across multiple observations of X
• Known as frequentist probability
• Also known as aleatory, physical or frequency probability
• Needs a well-defined random experiment / repetition mechanism
• The interpretation for one-off events, and those that have already occurred, is problematic (recall the ‘card trick’)

Bayesian probability
• The degree of plausibility, or strength of belief, of a given statement based on existing knowledge and evidence, expressed as a probability
• Known as Bayesian probability
• Also known as epistemic or evidential probability
• Can be assigned to any statement, even when no random process is involved, and irrespective of whether the event has yet occurred or not
• Example: what is the probability the dinosaurs were wiped out by an asteroid?
• Popularly expressed in terms of betting: if you were forced to make a bet on the outcome, what odds would you accept?

• Probability also has a mathematical definition, in terms of axioms. This is separate to its interpretation as a model of reality.
• When using mathematical probability, it is not self-evident that the ‘long-run relative frequency’ actually exists and is equal to the underlying probability you start with as part of the axioms; this is something that needs to be proved. It turns out to be true and this fact is known as the Law of Large Numbers.
• Most people only learn about the frequentist notion of probability. However, in practice they often naturally use the Bayesian notion, as the card trick demonstrated. They do so without necessarily knowing about the different notions of probability, which can sometimes lead to confusion.

Why use Bayesian probability?
• We do it naturally. Card trick, gambling odds,. . .
• Asking the right question. Allows us to directly answer the
question of interest
• Going beyond true/false. Can be viewed as an extension of formal logic that allows reasoning under uncertainty

The elements of Bayesian inference
• Take our existing statistical models and add:
◦ Parameters & hypotheses are modelled as random variables
• In other words:
◦ Parameters will have probability distributions ◦ Hypotheses will have probabilities
• These are Bayesian probabilities
• They quantify and express our uncertainty, both before (‘prior’)
and after (‘posterior’) seeing any data
• Requires the use of Bayes’ theorem

Motivating example
• A coin is either fair or unfair θ = Pr(heads) =
• Flip the coin 20 times
􏰐0.5 if coin is fair 0.7 if coin is unfair
• The number of heads is X ∼ Bi(20, θ)
• In light of the data, what can we say about whether the coin is fair?
• What does X tell us about θ?

Posterior distribution
• Goal: calculate Pr(coin is fair | X) = Pr(θ = 0.5 | X)
• More broadly, Pr(parameter or hypothesis | data)
• This is known the posterior distribution (or just the posterior)
• Quantifies our knowledge in light of the data we observe
• Posterior means ‘coming after’ in Latin
• In Bayesian inference, the posterior distribution summarises all of the information about the parameters of interest

Calculating the posterior
• Use Bayes’ theorem,
Pr(θ = 0.5 | X = x) = Pr(X = x | θ = 0.5) Pr(θ = 0.5)
Pr(X = x) • The denominator is (law of total probability),
Pr(X = x) = Pr(X = x | θ = 0.5) Pr(θ = 0.5)+ Pr(X = x | θ = 0.7) Pr(θ = 0.7)
• We need to specify:
◦ The likelihood, Pr(X | θ).
◦ The prior distribution (or just the prior), Pr(θ)
• In our example, the likelihood is a binomial distribution 20 of 70

Specifying the prior
• Also need a prior to get the whole thing off the ground
• Prior means ‘before’ in Latin
• Specifying an appropriate prior requires some thought (more details later)
• For now, let’s assume either outcome is equally plausible, Pr(fair coin) = Pr(unfair coin) = 0.5
Pr(θ = 0.5) = Pr(θ = 0.7) = 0.5

Example (card experiment)
• Select 5 cards at random (don’t look at them!)
• Sample from these n times with replacement
• Let X be the number of times you see a red card • Likelihood: X ∼ Bi(n, θ)
• θ ∈ { 0 , 15 , 25 , 35 , 45 , 1 }
• Use a uniform prior again,
Pr(θ=a)=61 (foralla)

• Calculate posterior,
Pr(θ = a | X = x) = Pr(X = x | θ = a) Pr(θ = a)
• The denominator is always just a sum/integral of the numerator,
Pr(X = x) = 􏰃 Pr(X = x | θ = b) Pr(θ = b) b
• For convenience, we often omit it,
Pr(θ = a | X = x) ∝ Pr(X = x | θ = a) Pr(θ = a)

• This gives,
􏰌n􏰍 x n−x1 Pr(θ=a|X=x)∝ x a(1−a) 6
∝ ax(1 − a)n−x
• Only need the terms that refer to the parameter values, a
• Now try it out…

Example (beta-binomial)
• X∼Bi(n,θ)
• Start with a uniform prior again (now a pdf, since continuous),
f(θ)=1, 0􏰀θ􏰀1 • Calculate posterior pdf,
f(θ | X = x) ∝ Pr(X = x | θ)f(θ) ∝ θx(1 − θ)n−x

• Calculate the normalising constant by integrating w.r.t. θ, 􏰑1 x!(n−x)!
θx(1−θ)n−xdθ=···= (n+1)! 0
• The posterior therefore has pdf,
f(θ|X=x)= (n+1)! θx(1−θ)n−x, 0􏰀θ􏰀1
x!(n−x)! • This is a beta distribution

Beta distribution
• A distribution over the unit interval, p ∈ [0, 1]
• Two parameters: α, β > 0
• Notation: P ∼ Beta(α, β)
• The pdf is:
f(p)= Γ(α+β)pα−1(1−p)β−1, 0􏰀p􏰀1 Γ(α) Γ(β)
• Γ is the gamma function, a generalisation of the factorial function. Note that Γ(n) = (n − 1)!

• Properties:
var(P) = αβ • Draw some pdfs to get an idea.
E(P)= α α+β
mode(P ) = α − 1 α+β−2
(α, β > 2) (α + β)2(α + β + 1)

Inference from the posterior
• For our example, the posterior is:
θ | X = x ∼ Beta(x + 1, n − x + 1)
• Could use the posterior mean as a point estimate:
• More options later. . .
E(θ | X = x) = x + 1 n+2

Different priors
• Let’s use a beta distribution as our prior, θ ∼ Beta(α, β) • This gives posterior pdf,
f(θ | X = x) ∝ Pr(X = x | θ)f(θ)
∝ θx(1 − θ)n−x × θα−1(1 − θ)β−1
= θx+α−1(1 − θ)n−x+β−1
• This is again in the form a beta distribution!
θ | X = x ∼ Beta(x + α, n − x + β)

Conjugate distributions
• Beta prior + binomial likelihood ⇒ beta posterior
• This a convenient property
• We say that the beta distribution is a conjugate prior for the binomial distribution
• Note: we initially used a uniform prior, which is equivalent to α=β=1

Pseudodata
• Can think of the prior as being equivalent to unobserved data
• It has the same influence on the posterior as an actual sample with
some sample size and particular (pseudo-)observations
• Provides an intuitive interpretation for the prior
• Works particularly well with conjugate priors
• A Beta(1, 1) prior is equivalent to a sample of size of 2, with 1 observed success and 1 observed failure.
• A Beta(α, β) prior is equivalent to a sample of size of α + β. The parameters α and β are often called pseudocounts.
• Pseudocounts can be non-integer

• The likelihood is sometimes called the ‘model’. But we sometimes refer to the whole setup (including the prior) as the ‘model’. In any case, we can at least call it the ‘model for the data’.
• Classical inference only works with a likelihood, but entails other choices about how to do inference (see later for a more detailed discussion of the differences between approaches)
• Parameters are modelled as random variables. This expresses our uncertainty of their value. We don’t actually think of them as being truly random quantities, as many textbooks suggest! We still think of them as representing some fixed underlying true value, but one we can never know for certain.

Summarising the posterior
• We’ve worked out the posterior. . . now what? • Visualise it
• Summarise it
• Think about what you wanted to learn
• What was your original question?

Point estimates
• Can calculate single-number (point) summaries
• Popular choices:
◦ Posterior mean, E(θ | X = x)
◦ Posterior median, median(θ | X = x) ◦ Posterior mode, mode(θ | X = x)
• Uniform prior ⇒ posterior mode = MLE
• The posterior standard deviation, sd(θ | X = x), gives a measure
of uncertainty (analogous to the standard error)
• For example, with n = 20, x = 15 and a uniform prior,
θ|X=15 ∼Beta(16,6) E(θ | X = 15) = 16 = 0.73
􏰅 16·6 sd(θ|X=15)= 222·23=0.093

Interval estimates (credible intervals)
• Can calculate intervals to represent the uncertainty
• Simply take probability intervals from the posterior, referred to as
credible intervals
• A 95% credible interval (a, b) is given by:
0.95 = Pr(a < θ < b | X) • For example, with n = 20, x = 15 and a uniform prior, the central 95% credible interval is given by: > qbeta(c(0.025, 0.975), 16, 6)
[1] 0.5283402 0.8871906
• Analogous to confidence intervals, but easier to interpret/explain
• Can calculate one-sided or two-sided intervals, as required 37 of 70

Visual summaries
• Not always necessary to summarise into one or two numbers • Plot the posterior (bar plot, density curve, box plot,. . . )
• This is often more informative
• Helps to avoid placing too much emphasis on the tails
• For example:
> curve(dbeta(x, 16, 6), from = 0, to = 1)

0.0 0.2 0.4
0.6 0.8 1.0
dbeta(x, 16, 6)

Specific posterior probabilities
• Posterior probabilities of events relevant to the problem, for example:
Pr(μ > 0 | data)
• More generally, can calculate posterior distributions for arbitrary
functions of the parameters, for example:
􏰌θ􏰝􏰍 f 1−θ􏰝􏰝data

Computation
• Often cannot derive or write down an expression for the posterior
• True for nearly all modern applications of Bayesian analysis!
• Use computational techniques instead (like resampling methods)
• Typically work with simulations (‘samples’) from the posterior (see the lab)
• Most common class of methods: Markov chain Monte Carlo (MCMC)
• The ability to do this, due to advances in computation, has led to a surge in popularity of Bayesian methods
• Topic is too advanced for this subject, but watch out for it in later years

• Only consider single-parameter models
• Only consider conjugate priors
• Examples are intentionally similar to those from earlier modules

Normal, single mean, known σ
• Random sample: X1, . . . , Xn ∼ N(θ, σ2), with σ2 known
• For simplicity, summarise the data by: Y = X ̄ ∼ N(θ, σ2/n) • Prior: θ∼N(μ0,σ02)
• Deriving the posterior:
f(θ | y) ∝ f(y | θ)f(θ)
1 − 1 (y−θ)2 1 − 2(θ−μ0)
= √ e 2σ2/n
12 √ e 2σ0
􏰎 (y−θ)2 ∝exp−2σ2/n− 2σ02

• We can simplify this as:
f(θ|y)∝exp − 2σ12
μ1=σ02 σ2/nand1=1+1
1+1 σ2 σ2 σ2/n σ02σ2/n 10
􏰎 (θ−μ1)2􏰏
by defining,
• Recognise this as a normal pdf (so, we immediately know the normalising constant)
• Posterior: θ | y ∼ N(μ1, σ12)

• ‘1/ var’ is called the precision
• Posterior precision is the sum of the prior and data precisions:
σ12 σ02 σ 2 /n
• Posterior mean is a weighted average of the sample mean, y = x ̄,
and the prior mean, μ0, weighted by their precisions: 􏰘1􏰙􏰘1􏰙
σ02 σ2 /n μ1=1+1 μ0+1+1 y
σ02 σ2 /n σ02 σ2 /n
• More data ⇒ higher data precision ⇒ more influence on the

• Credible intervals: probability intervals from the posterior (normal) • For example, a central 95% credible interval for θ looks like:
μ1 ± 1.96σ1

Example (normal, single mean, known σ)
• X ∼ N(θ, σ2 = 362) is the lifetime of a light bulb, in hours.
• Suppose we knew from experience that the lifetime is somewhere between 1200 and 1600 hours. We could summarise this with a prior θ ∼ N(μ0 = 1400, σ02 = 1002), which places 95% probability on that range.
• Testn=27lightbulbsandgety=x ̄=1478
• Posterior: θ | y ∼ N(1478, 6.912)
• 95% credible interval: (1464, 1491)
• If we use a more informative prior: θ ∼ N(μ0 = 1400, σ02 = 102)
• Posterior: θ | y ∼ N(1453, 5.692)
• 95% credible interval: (1442, 1464) 48 of 70

A less informative prior
• Can make the prior progressively less informative by reducing its precision: σ0 → ∞
• In the limit, we get a ‘flat’ prior across the whole real line
• (μ0 disappears from the model)
• Not a valid probability distribution, cannot integrate to 1
• But it works: it gives us a valid posterior,
σ12 =σ2/n and μ1 =y=x ̄
and the credible intervals are the same as the confidence intervals.
• This type of prior (cannot integrate to 1) is called an improper prior
• If the prior does integrate to 1, it is called a proper prior
• Can think of improper priors as approximations to very
uninformative proper priors

Binomial (again)
• X∼Bi(n,θ)
• Prior: θ ∼ Beta(α, β)
• Posterior: θ|x∼Beta(α+x,β+n−x) • Posterior mean:
E(θ|x)= α+x α+β+n
􏰌 α+β 􏰍􏰌 α 􏰍 􏰌
= α+β+n α+β + α+β+n n
which is a weighted average of the prior mean and the MLE (x/n).

Example (binomial)
• In a survey of n = 351 voters, x = 185 favour a particular candidate
• Use uniform prior
• Posterior: θ | x ∼ Beta(1 + 185, 1 + 351 − 185) = Beta(186, 167)
• 95% credible interval: (0.475, 0.579)
• Posterior probability of a majority:
Pr(θ > 0.5 | x) = 0.84
> 1 – pbeta(0.5, 186, 167) [1] 0.8444003

• Suppose an initial survey suggested support was only 45%
• Include this knowledge as a prior
• Deem it to be worth equivalent to a (pseudo) sample size of 20 • Therefore, our prior should satisfy:
= 0.45 α+β =20
⇒α=9, β=11
• Posterior: θ | x ∼ Beta(9 + 185, 11 + 351 − 185) = Beta(194, 177) • 95% credible interval: (0.472, 0.574)
• Posterior probability of a majority:
Pr(θ > 0.5 | x) = 0.81

Challenge problem (exponential distribution)
Random sample: X1, . . . , Xn ∼ Exp(λ) Find a conjugate prior distribution for λ. What is the resulting posterior mean?

Challenge problem (boundary problem)
Random sample of size n from the shifted exponential distribution, with pdf:
Equivalently:
f(x) = e−(x−θ) (x > θ) Xi ∼ θ + Exp(1)
Use a ‘flat’ improper prior for θ and derive the posterior. Derive a one-sided 95% credible interval.

Aspects of prior distributions
• We already defined: ◦ Conjugate priors
◦ Improper priors ◦ Proper priors
• We now cover:
◦ Choosing appropriate priors
◦ Seeking ‘noninformative’ priors ◦ Sensitivity analysis

How do we choose an appropriate prior?
• Considerations:
◦ Existing knowledge (try to encapsulate/quantify)
◦ Plausability of various values
◦ Ability of data to ‘overwhelm’ the prior
• The prior should be diffuse enough to allow the data, if sufficient enough, to overwhelm it
• Usually the prior will be much less precise than the data (otherwise, why are we bothering to collect data?)
• If the prior and the data are (vastly) in conflict, something has likely gone wrong: go back and check your assumptions
• Since we expect the data to dominate, we don’t need to be overly worried with the exact shape of the prior
• (All of this becomes more delicate in higher dimensions. . . ) 57 of 70

‘Noninformative’ priors
• Can we use a prior that has no influence on the posterior?
• Sometimes: e.g. improper prior for θ in N(θ,σ2)
• But usually not
• ‘Noninformative’ depends on the parameterisation
• What is noninformative on one scale can be informative on a different scale for the same parameter!
• Example, for binomial sampling, Bi(n, θ): ◦ θ ∼ Beta(1, 1) is uniform for θ
◦ θ ∼ Beta(0, 0) is uniform for log (θ/(1 − θ))
◦ θ ∼ Beta( 12 , 21 ) is invariant under reparameterisation
(“Jeffreys’ prior”)
• So, generally talk about ‘diffuse’ priors rather than ‘noninformative’

Sensitivity analysis
• Not sure about your prior?
• Worried that it might be too influential?
• Try a range of

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts