程序代做 MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022

• Explain two different ways to use probability for modelling • Introduce the Bayesian approach to statistical inference
• Review the probability tools required to carry this out
• Show examples of Bayesian inference for simple models
• Discuss how to chose an appropriate prior

Copyright By PowCoder代写 加微信 powcoder

• Compare and contrast Bayesian & classical inference
Review of probability
Bayesian methods
(Module 9)
Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022
1 Review of probability 1
2 Interpretations of probability 3
3 Bayesian inference: an introduction 4 3.1 TheBayesian‘recipe’………………………………………. 4 3.2 Usingtheposterior……………………………………….. 8
4 Bayesian inference: further examples 9 4.1 Normal …………………………………………….. 10 4.2 Binomial…………………………………………….. 11 4.3 Other ……………………………………………… 12
5 Prior distributions 12
6 Comparing Bayesian & classical inference 13
Aims of this module
From our last lecture. . .
• Disease testing example • Tree diagrams
Review some probability definitions
• Let A and B be two events
• Often these are in terms of random variables, e.g. A = ‘X = 3’ • Joint probability
Pr(A, B) = Pr(A ∩ B) = Pr(A and B both occur)

• Marginal probability
• Conditional probability
Bayes’ theorem
Pr(A | B) = Pr(A occurs given that B occurs) = Pr(A, B) Pr(B)
Pr(B | A) = Pr(A | B) Pr(B) Pr(A)
Bayes’ theorem again
Pr(B |A)= i
Pr(A|Bi)Pr(Bi) 􏰂ki=1 Pr(A | Bi) Pr(Bi)
Sometimes write this more compactly as:
Continuous random variables
Pr(Bi | A) ∝ Pr(A | Bi) Pr(Bi)
Pr(A) = Pr(A occurs irrespective of B) = Pr(A, B) + Pr(A, B ̄)
The denominator can be written out using:
Pr(A) = Pr(A, B) + Pr(A, B ̄)
= Pr(A | B) Pr(B) + Pr(A | B ̄) Pr(B ̄)
Partitions
• Let B1,B2,…,Bk be a partition of the sample space
• This ’splits up’ the sample space into distinct events
• More precisely, the events cover the whole sample space (B1 ∪ B2 ∪ · · · ∪ Bk = Ω) and are mutually exclusive (Bi∩Bj =∅wheni̸=j).
• Example: roll a die and let the outcome be X, the events B1 = ‘X is even’ and B2 = ‘X is odd’ form a partition.
• The law of total probability relates marginal and conditional probabilities,
Pr(A) = 􏰃 Pr(A, Bi) = 􏰃 Pr(A | Bi) Pr(Bi)
Analogous definitions in terms of density functions (for rvs X and Y ): • Joint pdf
• Marginal pdf (law of total probability)
f(x) = f(x,y)dy = f(x | y)f(y)dy
􏰑∞ 􏰑∞ −∞ −∞

• Conditional pdf
• Bayes’ theorem
f(x|y)=f(x|Y =y)= f(x,y) f(y)
f(x | y) = f(y | x)f(x) f(y)
Interpretations of probability
How do we use probability?
• Modelling variation (frequentist probability)
• Representing uncertainty (Bayesian probability) Classical inference only uses frequentist probability Bayesian inference uses both
Frequentist probability
• The relative frequency of occurrence in the long run, under hypothetical repetitions an experiment
• This is what we usually have in mind when devising a statistical model for the data
• Example: X ∼ N(μ, σ2), specifies a model for variation across multiple observations of X
• Known as frequentist probability
• Also known as aleatory, physical or frequency probability
• Needs a well-defined random experiment / repetition mechanism
• The interpretation for one-off events, and those that have already occurred, is problematic (recall the ‘card trick’)
Bayesian probability
• The degree of plausibility, or strength of belief, of a given statement based on existing knowledge and
evidence, expressed as a probability
• Known as Bayesian probability
• Also known as epistemic or evidential probability
• Can be assigned to any statement, even when no random process is involved, and irrespective of whether the event has yet occurred or not
• Example: what is the probability the dinosaurs were wiped out by an asteroid?
• Popularly expressed in terms of betting: if you were forced to make a bet on the outcome, what odds would you
• Probability also has a mathematical definition, in terms of axioms. This is separate to its interpretation as a
model of reality.
• When using mathematical probability, it is not self-evident that the ‘long-run relative frequency’ actually exists and is equal to the underlying probability you start with as part of the axioms; this is something that needs to be proved. It turns out to be true and this fact is known as the Law of Large Numbers.
• Most people only learn about the frequentist notion of probability. However, in practice they often naturally use the Bayesian notion, as the card trick demonstrated. They do so without necessarily knowing about the different notions of probability, which can sometimes lead to confusion.

Why use Bayesian probability?
We do it naturally. Card trick, gambling odds,. . .
Asking the right question. Allows us to directly answer the question of interest
Going beyond true/false. Can be viewed as an extension of formal logic that allows reasoning under uncer- tainty
Bayesian inference: an introduction The Bayesian ‘recipe’
The elements of Bayesian inference
• Take our existing statistical models and add:
– Parameters & hypotheses are modelled as random variables
• In other words:
– Parameters will have probability distributions – Hypotheses will have probabilities
• These are Bayesian probabilities
• They quantify and express our uncertainty, both before (‘prior’) and after (‘posterior’) seeing any data • Requires the use of Bayes’ theorem
Motivating example
• A coin is either fair or unfair
􏰐0.5 if coin is fair 0.7 if coin is unfair
θ = Pr(heads) =
• Flip the coin 20 times
• The number of heads is X ∼ Bi(20, θ)
• In light of the data, what can we say about whether the coin is fair? • What does X tell us about θ?
Posterior distribution
• Goal: calculate Pr(coin is fair | X) = Pr(θ = 0.5 | X)
• More broadly, Pr(parameter or hypothesis | data)
• This is known the posterior distribution (or just the posterior)
• Quantifies our knowledge in light of the data we observe
• Posterior means ‘coming after’ in Latin
• In Bayesian inference, the posterior distribution summarises all of the information about the parameters of interest
Calculating the posterior
• Use Bayes’ theorem,
Pr(θ = 0.5 | X = x) = Pr(X = x | θ = 0.5) Pr(θ = 0.5) Pr(X = x)

• The denominator is (law of total probability),
Pr(X = x) = Pr(X = x | θ = 0.5) Pr(θ = 0.5)+
Pr(X = x | θ = 0.7) Pr(θ = 0.7)
• We need to specify:
– The likelihood, Pr(X | θ).
– The prior distribution (or just the prior), Pr(θ)
• In our example, the likelihood is a binomial distribution
Specifying the prior
• Also need a prior to get the whole thing off the ground
• Prior means ‘before’ in Latin
• Specifying an appropriate prior requires some thought (more details later) • For now, let’s assume either outcome is equally plausible,
Putting it together
• This gives,
• For example,
Example (card experiment)
Pr(fair coin) = Pr(unfair coin) = 0.5 Pr(θ = 0.5) = Pr(θ = 0.7) = 0.5
Pr(θ = 0.5 | X = x) = Pr(X = x | θ = 0.5)
Pr(X = x | θ = 0.5) + Pr(X = x | θ = 0.7)
Pr(θ = 0.5 | X = 15) = Pr(θ = 0.5 | X = 10) = Pr(θ=0.5|X= 5)=
• Select 5 cards at random (don’t look at them!)
• Sample from these n times with replacement
• Let X be the number of times you see a red card • Likelihood: X ∼ Bi(n, θ)
• θ ∈ { 0 , 51 , 25 , 35 , 45 , 1 }
• Use a uniform prior again,
• Calculate posterior,
Pr(θ=a)=16 (foralla)
Pr(θ = a | X = x) = Pr(X = x | θ = a) Pr(θ = a)
Pr(X = x) • The denominator is always just a sum/integral of the numerator,
Pr(X = x) = 􏰃 Pr(X = x | θ = b) Pr(θ = b) b
• For convenience, we often omit it,
Pr(θ = a | X = x) ∝ Pr(X = x | θ = a) Pr(θ = a)

• This gives,
􏰌n􏰍 x n−x1 Pr(θ=a|X=x)∝ x a(1−a) 6
∝ ax(1 − a)n−x • Only need the terms that refer to the parameter values, a
• Now try it out…
Example (beta-binomial)
• X∼Bi(n,θ)
• Start with a uniform prior again (now a pdf, since continuous),
• Calculate posterior pdf,
f(θ)=1, 0􏰀θ􏰀1
f(θ | X = x) ∝ Pr(X = x | θ)f(θ) ∝ θx(1 − θ)n−x
• Calculate the normalising constant by integrating w.r.t. θ,
􏰑1 x!(n−x)!
θx(1−θ)n−xdθ=···= (n+1)! 0
• The posterior therefore has pdf,
f(θ|X=x)= (n+1)! θx(1−θ)n−x, 0􏰀θ􏰀1
• This is a beta distribution
Beta distribution
• A distribution over the unit interval, p ∈ [0, 1] • Two parameters: α, β > 0
• Notation: P ∼ Beta(α, β)
• The pdf is:
f(p)= Γ(α+β)pα−1(1−p)β−1, 0􏰀p􏰀1 Γ(α) Γ(β)
• Γ is the gamma function, a generalisation of the factorial function. Note that Γ(n) = (n − 1)! • Properties:
E(P)= α α+β
mode(P ) = var(P) =
α − 1 α+β−2
(α, β > 2) (α + β)2(α + β + 1)
• Draw some pdfs to get an idea.

Inference from the posterior
• For our example, the posterior is:
• Could use the posterior mean as a point estimate:
θ | X = x ∼ Beta(x + 1, n − x + 1)
E(θ | X = x) = x + 1 n+2
• More options later. . .
Different priors
• Let’s use a beta distribution as our prior, θ ∼ Beta(α, β) • This gives posterior pdf,
f(θ | X = x) ∝ Pr(X = x | θ)f(θ)
∝ θx(1 − θ)n−x × θα−1(1 − θ)β−1
= θx+α−1(1 − θ)n−x+β−1
• This is again in the form a beta distribution!
θ | X = x ∼ Beta(x + α, n − x + β)
Conjugate distributions
• Beta prior + binomial likelihood ⇒ beta posterior
• This a convenient property
• We say that the beta distribution is a conjugate prior for the binomial distribution • Note: we initially used a uniform prior, which is equivalent to α = β = 1
Pseudodata
• Can think of the prior as being equivalent to unobserved data
• It has the same influence on the posterior as an actual sample with some sample size and particular (pseudo-
)observations
• Provides an intuitive interpretation for the prior
• Works particularly well with conjugate priors
• A Beta(1, 1) prior is equivalent to a sample of size of 2, with 1 observed success and 1 observed failure.
• A Beta(α, β) prior is equivalent to a sample of size of α+β. The parameters α and β are often called pseudocounts.
• Pseudocounts can be non-integer
• The likelihood is sometimes called the ‘model’. But we sometimes refer to the whole setup (including the prior) as the ‘model’. In any case, we can at least call it the ‘model for the data’.
• Classical inference only works with a likelihood, but entails other choices about how to do inference (see later for a more detailed discussion of the differences between approaches)
• Parameters are modelled as random variables. This expresses our uncertainty of their value. We don’t actually think of them as being truly random quantities, as many textbooks suggest! We still think of them as representing some fixed underlying true value, but one we can never know for certain.

3.2 Using the posterior
Summarising the posterior
• We’ve worked out the posterior. . . now what? • Visualise it
• Summarise it
• Think about what you wanted to learn
• What was your original question?
Point estimates
• Can calculate single-number (point) summaries
• Popular choices:
– Posterior mean, E(θ | X = x)
– Posterior median, median(θ | X = x) – Posterior mode, mode(θ | X = x)
• Uniform prior ⇒ posterior mode = MLE
• The posterior standard deviation, sd(θ | X = x), gives a measure of uncertainty (analogous to the standard
• For example, with n = 20, x = 15 and a uniform prior,
θ|X=15 ∼Beta(16,6) E(θ | X = 15) = 16 = 0.73
sd(θ | X = 15) = 222 · 23 = 0.093
Interval estimates (credible intervals)
• Can calculate intervals to represent the uncertainty
• Simply take probability intervals from the posterior, referred to as credible intervals • A 95% credible interval (a, b) is given by:
0.95 = Pr(a < θ < b | X) • For example, with n = 20, x = 15 and a uniform prior, the central 95% credible interval is given by: > qbeta(c(0.025, 0.975), 16, 6)
[1] 0.5283402 0.8871906
• Analogous to confidence intervals, but easier to interpret/explain • Can calculate one-sided or two-sided intervals, as required
Visual summaries
• Not always necessary to summarise into one or two numbers • Plot the posterior (bar plot, density curve, box plot,. . . )
• This is often more informative
• Helps to avoid placing too much emphasis on the tails
• For example:
> curve(dbeta(x, 16, 6), from = 0, to = 1)

0.0 0.2 0.4
Specific posterior probabilities
0.6 0.8 1.0
• Often cannot derive or write down an expression for the posterior
• True for nearly all modern applications of Bayesian analysis!
• Use computational techniques instead (like resampling methods)
• Typically work with simulations (‘samples’) from the posterior (see the lab) • Most common class of methods: Markov chain Monte Carlo (MCMC)
• The ability to do this, due to advances in computation, has led to a surge in popularity of Bayesian methods • Topic is too advanced for this subject, but watch out for it in later years
Bayesian inference: further examples
• Only consider single-parameter models
• Only consider conjugate priors
• Examples are intentionally similar to those from earlier modules
• Posterior probabilities of events relevant to the problem, for example: Pr(μ > 0 | data)
• More generally, can calculate posterior distributions for arbitrary functions of the parameters, for example: 􏰌θ􏰝􏰍
f 1−θ􏰝􏰝data
Computation
dbeta(x, 16, 6)

4.1 Normal
Normal, single mean, known σ
• Random sample: X1, . . . , Xn ∼ N(θ, σ2), with σ2 known
• For simplicity, summarise the data by: Y = X ̄ ∼ N(θ, σ2/n)
• Prior: θ∼N(μ0,σ02)
• Deriving the posterior:
f(θ | y) ∝ f(y | θ)f(θ)
1−1(y−θ)2 1−1(θ−μ0)2
= √ e 2σ2/n √σ 2π
μ1=σ02 σ2/nand1=1+1
1+1 σ2 σ2 σ2/n σ02σ2/n 10
√ e 2σ02 σ0 2π
• We can simplify this as:
by defining,
(θ−μ0)2􏰏 􏰎 (θ−μ1)2􏰏
􏰎 (y−θ)2 ∝exp−2σ2/n− 2σ02
f(θ|y)∝exp − 2σ12
• Recognise this as a normal pdf (so, we immediately know the normalising constant)
• Posterior: θ | y ∼ N(μ1, σ12)
• ‘1/ var’ is called the precision
• Posterior precision is the sum of the prior and data precisions:
σ12 σ02 σ 2 /n
• Posterior mean is a weighted average of the sample mean, y = x ̄, and the prior mean, μ0, weighted by their
precisions:
• More data ⇒ higher data precision ⇒ more influence on the posterior
• Credible intervals: probability intervals from the posterior (normal)
• For example, a central 95% credible interval for θ looks like:
μ1 ± 1.96σ1
Example (normal, single mean, known σ)
• X ∼ N(θ, σ2 = 362) is the lifetime of a light bulb, in hours.
• Suppose we knew from experience that the lifetime is somewhere between 1200 and 1600 hours. We could summarise this with a prior θ ∼ N(μ0 = 1400, σ02 = 1002), which places 95% probability on that range.
• Testn=27lightbulbsandgety=x ̄=1478
• Posterior: θ | y ∼ N(1478, 6.912 )
• 95% credible interval: (1464, 1491)
• If we use a more informative prior: θ ∼ N(μ0 = 1400, σ02 = 102 )
• Posterior: θ | y ∼ N(1453, 5.692 )
• 95% credible interval: (1442, 1464)
􏰘1􏰙􏰘1􏰙 σ02 σ2 /n
μ1=1+1 μ0+1+1 y σ02 σ2 /n σ02 σ2 /n

A less informative prior
• Can make the prior progressively less informative by reducing its precision: σ0 → ∞
• In the limit, we get a ‘flat’ prior across the whole real line
• (μ0 disappears from the model)
• Not a valid probability distribution, cannot integrate to 1
• But it works: it gives us a valid posterior,
σ12 =σ2/n and μ1 =y=x ̄
and the credible intervals are the same as the confidence intervals.
• This type of prior (cannot integrate to 1) is called an improper prior
• If the prior does integrate to 1, it is called a proper prior
• Can think of improper priors as approximations to very uninformative proper priors
4.2 Binomial
Binomial (again)
• X∼Bi(n,θ)
• Prior: θ ∼ Beta(α, β)
• Posterior: θ|x∼Beta(α+x,β+n−x)
• Posterior mean:
E(θ|x)= α+x α+β+n
􏰌 α+β 􏰍􏰌 α 􏰍 􏰌
= α+β+n α+β + α+β+n n
which is a weighted average of the prior mean and the MLE (x/n).
Example (binomial)
• In a survey of n = 351 voters, x = 185 favour a particular candidate • Use uniform prior
• Posterior: θ | x ∼ Beta(1 + 185, 1 + 351 − 185) = Beta(186, 167)
• 95% credible interval: (0.475, 0.579)
• Posterior probability of a majority: • R code:
> 1 – pbeta(0.5, 186, 167)
[1] 0.8444003
Pr(θ > 0.5 | x) = 0.84
• Suppose an initial survey suggested support was only 45%
• Include this knowledge as a prior
• Deem it to be worth equivalent to a (pseudo) sample size of 20 • Therefore, our prior should satisfy:
= 0.45 α+β =20
⇒α=9, β=11 11

• Posterior: θ | x ∼ Beta(9 + 185, 11 + 351 − 185) = Beta(194, 177) • 95% credible interval: (0.472, 0.574)
• Posterior probability of a majority:
Pr(θ > 0.5 | x) = 0.81
Challenge problem (exponential distribution)
Random sample: X1, . . . , Xn ∼ Exp(λ) Find a conjugate prior distribution for λ. What is the resulting posterior mean?
Challenge problem (boundary problem)
Random sample of size n from the shifted exponential distribution, with pdf: f(x) = e−(x−θ) (x > θ)
Equivalently:
Xi ∼ θ + Exp(1) Use a ‘flat’ improper prior for θ and derive the posterior.
Derive a one-sided 95% credible interval.
5 Prior distributions
Aspects of prior distributions
• We already defined:
– Conjugate priors
– Improper priors
– Proper priors
• We now cover:
– Choosing appropriate priors
– Seeking ‘noninformative’ priors – Sensitivity analysis
How do we choose an appropriate prior?
• Considerations:
– Existing knowledge (try to encapsulate/quantify) – Plausability of various values
– Ability of data to ‘overwhelm’ the prior
• The prior should be diffuse enough to allow the data, if sufficient enough, to overwhelm it
• Usually the prior will be much less precise than the data (otherwise, why are we bothering to collect data?)
• If the prior and the data are (vastly) in conflict, something has likely gone wrong: go back and check your assumptions
• Since we expect the data to dominate, we don’t need to be overly worried with the exact shape of the prior 12

• The (potential) sensitivity to the prior is a key feature of Bayesian inference.
• If the prior is influential, and you don’t really believe it, then you have insufficient data. • Either need more data, or a more reliable prior.
• This is not a ‘bug’, it is a feature!
• It alerts you to the relative amount of information in your data (or the lack of it)
Comparing Bayesian & classical inference
• (All of this becomes more delicate in higher dimensions. . . )
‘Noninformative’ priors
• Can we use a prior that has no influence on the posterior?
• Sometimes: e.g. improper prior for θ in N(θ,σ2)
• But usually not
• ‘Noninformative’ depends on the parameterisation
• What is noninformative on one scale can be informative on a different scale for the same parameter! • Example, for binomial sampling, Bi(n, θ):
– θ ∼ Beta(1, 1) is uniform for θ
– θ ∼ Beta(0, 0) is uniform for log (θ/(1 − θ))
– θ ∼ Beta( 12 , 12 ) is invariant under reparameterisation (“Jeffreys’ prior”)
• So, generally talk about ‘diffuse’ priors rather than ‘noninformative’
Sensitivity analysis
• Not sure about your prior?
• Worried that it might be too influential?
• Try a range of different priors
• This is a sensitivity analysis
• Useful to cover a reasonable set of ‘extreme’ views, plus typical diffuse priors
Sensitivity to the prior
Goals & philosophies
• Shared goals:
– Learning about the population – Estimating parameters
– Making decisions
– Prediction
• Different underlying philosophies: – Use of probability
– Manner of inference
– Interpretation of results

Assumptions
• Bayesian inference needs a prior.
• Sometimes seen as a weakness, but this aspect is usually overplayed or misrepresented as being overly subjective.
• Classical inference (a.k.a. frequentist inference) requires further choices (e.g. which estimator to use), which can be just as arbitrary or ad hoc as a choice of prior.
• Choice of likelihood is also cruicial, and involves similar considerations (and problems) to choosing a prior, but people often overlook this.
• Complex models often start blurring the boundary between the two anyway. Making assumptions explicit

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com