Bayes’ Rule
Bayesian Statistics Statistics 4224/5224 Spring 2021
January 12, 2021
1
Bayes’ rule: A discrete example
You go to the doctor for a COVID-19 rapid test.
Let θ represent your true state: θ = 1 if you are infected with
the virus, and θ = 0 if you are not.
Let y denote the result of the test: y = 1 if the test comes back
positive and y = 0 if the test is negative.
The probability of a correct result for an infected subject is called
the sensitivity of the test, we will denote it p here.
The probability of correct result for one not infected with the virus is called the specificity, denote q.
2
Thus
Pr(y=1|θ=1)=p and Pr(y=0|θ=0)=q.
Do you have the virus? Of course you don’t know. In Bayesian statistics we describe our uncertainty about the world by assign- ing probabilities to various states. Suppose your prior probability of being infected is π. That is, the marginal probability
Pr(θ = 1) = π .
Let’s suppose you will take two tests, let yi = 1 if the ith test
comes back positive, and 0 if it’s negative, for i = 1, 2.
We assume that your true state does not change between tests, and that the results are stochastically independent.
3
We can define a joint probability model for the random variables (θ,y1,y2) as follows:
y1, y2|θ ∼ iid
Bernoulli(p) if θ = 1
Bernoulli(1 − q) if θ = 0 , where θ ∼ Bernoulli(π).
Part (a) Part (b) Part (c) Part (d)
Find Pr(y1 = 1).
Find Pr(θ = 1|y1 = 1).
Find Pr(y2 = 1|y1 = 1). Find Pr(θ = 1|y1 = y2 = 1).
Suppose π=0.04 and p=0.90 and q=0.92.
4
Answers:
(a) The marginal probability Pr(y1 = 1) is
Pr(θ = 1)Pr(y1 = 1|θ = 1)+Pr(θ = 0)Pr(y1 = 1|θ = 0)
and thus
Pr(y1 = 1) = πp+(1−π)(1−q) = 0.1128
5
(b) The posterior probability of infection, given a positive test:
Pr(θ = 1|y1 = 1) = Pr(θ = 1, y1 = 1) Pr(y1 = 1)
= Pr(θ = 1)Pr(y1 = 1|θ = 1) Pr(y1 = 1)
= πp ≈ 0.32 πp + (1 − π)(1 − q)
6
(c) Perhaps the most straightforward solution goes Pr(y2 =1|y1 =1)= Pr(y1 =y2 =1)
Pr(y1 = 1)
πp2 +(1−π)(1−q)2
= πp+(1−π)(1−q) ≈ 0.34
An alternative solution uses the posterior probability from part (b), call this π∗, as the prior probability of infection for the second test.
First π∗ = Pr(θ = 1|y1 = 1) = 0.31915 from part (b), then Pr(y2 = 1|y1 = 1) = π∗p+(1−π∗)(1−q) ≈ 0.34
7
(d) Again, we can solve this by two equivalent approaches. First, apply Bayes’ rule to the complete data (y1 = 1, y2 = 1).
Pr(θ = 1|y1 = y2 = 1) = Pr(θ = 1, y1 = y2 = 1) Pr(y1 = y2 = 1)
= Pr(θ = 1)Pr(y1 = y2 = 1|θ = 1) Pr(y1 = y2 = 1)
πp2
= πp2 +(1−π)(1−q)2 ≈ 0.84
The second approach uses the posterior probability π∗ = Pr(θ = 1|y1 = 1) = 0.31915
8
as our prior for the observation of y2 = 1, then
Pr(θ = 1|y1 = y2 = 1) = π∗p ≈ 0.84
π∗p + (1 − π∗)(1 − q)
Looking at the algebraic expressions, it is by no means obvious
that the two approaches will yield the same answer.
Yet they are both logically correct, so they must!
The second approach illustrates why we call this Bayesian learn- ing — as we collect more data we learn about the world, and update our belief accordingly.
Walking into the doctor’s office, your probability of Covid was 0.04. After a positive test it went up to 0.32, and after a second positive test it went up to 0.84.
9
This set of simple (?) probability calculations represents a com- plete Bayesian statistical inference.
We have all the elements!
We have an unknown quantity of interest, your status θ.
We have our prior knowledge about θ, which we represent as a probability distribution.
We have well-defined, observed data, the rapid test result y.
We have a probability model that specifies the dependence of y on the value of θ.
10
Terminology
• θ is the parameter,
• the probability distribution p(θ) is the prior,
• the probability distribution p(y|θ) is called the sampling model or likelihood; and finally,
• the conditional distribution of the parameter θ given the ob- served data y, p(θ|y) is called the posterior.
11
Bayesian inference
Hoff (2009) describes the process of Bayesian learning as follows.
1. For each numerical value θ, our prior distribution p(θ) de- scribes our belief that θ represents the true population char- acteristics.
2. For each θ and y, our sampling model p(y|θ) describes our belief that y would be the outcome of our study if we knew θ to be true.
Once we obtain the data y, the last step is to update our beliefs about θ:
3. For each numerical value of θ, our posterior distribution p(θ|y)
describes our belief that θ is the true value, having observed the dataset y.
12
The posterior distribution is obtained from the prior distribution and sampling model via Bayes’ rule:
p(θ|y) = p(θ)p(y|θ) p(y)
Noting that the denominator
p(y) =
does not depend on θ and can thus be absorbed into the nor-
malizing constant for the posterior density, we write p(θ|y) ∝ p(θ)p(y|θ) .
p(θ)p(y|θ)dθ
13
Welcome to Statistics 4224/5224!
Here is some important course information:
Textbooks
• Bayesian Data Analysis, third edition; by Gelman et al • A First Course in Bayesian Statistical Methods; by Hoff • Bayesian Statistical Methods; by Reich and Ghosh
The Gelman text has a lot more in it than we need for our course. I find this book overwhelming. Note this book is now free for non-commercial purposes.
http://www.stat.columbia.edu/~gelman/book/
14
Hoff and Reich & Ghosh both give more streamlined presenta- tions.
Hoff’s book may be free as well, via https://link.springer.com/, though this may require access to Columbia’s network.
We will use Gelman et al as a course road map, and I will try to always use notation and terminology that is consistent with this book.
Homework exercises may be drawn from all three references; however, a complete statement of each problem will be given in the assignment. And I will, if necessary, “translate” the state- ment of the problem to use notation and terminology consistent with Gelman’s.
15
Homework
We’ll have six (6) problem sets over the course of the semester, one due every other week.
Homework assignments are nominally due before class on Tues- day, the deadline for submission to Courseworks will be midnight on Wednesday.
Your first assignment is due Tue Jan 26 (Wed Jan 27)!
Shortly after midnight on Wednesday, homework solutions will be posted. No credit is given for homework submitted after that time.
16
Bayes’ rule: A continuous example
Suppose we are interested in the prevalence of an infectious disease in a small city.
A random sample of 20 individuals will be checked for infection. Interest is in θ, the fraction of infected individuals in the city.
The data y will record the number of people in the sample who are infected.
17
Sampling model
If the value of θ were known, a reasonable sampling model for y would be
y|θ ∼ Binomial(20, θ) .
Prior distribution
Other studies from various parts of the country indicate that the infection rate in comparable cities ranges from about 0.05 to 0.20, with an average prevalence of 0.10.
We will represent our prior information about θ by θ ∼ Beta(2, 20) .
18
The prior expected value for θ is 0.09; about two-thirds of the area under the curve occurs between 0.05 and 0.20.
Posterior distribution
For values of y=0,1,…,n and 0<θ<1 we we have p(θ) = Γ(a, b) θa−1(1 − θ)b−1 ∝ θa−1(1 − θ)b−1
and
and thus
Γ(a)Γ(b) p(y|θ) = n!
θy(1 − θ)n−y ∝ θy(1 − θ)n−y p(θ|y) ∝ p(θ)p(y|θ) ∝ θa+y−1(1 − θ)b+n−y−1
y!(n − y)!
19
Thus the following general result:
If θ ∼ Beta(a, b), and y|θ ∼ Binomial(n, θ), then
θ|y ∼ Beta(a+y,b+n−y) .
Suppose we observed y = 0 infections among the n = 20 indi-
viduals tested.
Combining this data with our Beta(a = 2,b = 20) prior, we obtain
θ|y = 0 ∼ Beta(2, 40) .
The posterior density is further to the left than the prior distri- bution, and more peaked as well.
20
Prior Posterior
0.0 0.1 0.2
• It is to the left of p(θ) because the observation y = 0 provides evidence of a low value of θ.
• It is more peaked than p(θ) because it combines information from the data and the prior distribution, and thus contains more information than in p(θ) alone.
21
θ
0.3 0.4 0.5
Density
0 5 10 15
The posterior expectation satisfies
E(θ|y)= a+y b+n−y
= n ·y+a+b·a a+b+n n a+b+n a+b
= n y ̄+ w θ0 w+n w+n
where θ0 = a/(a+b) is the prior expectation of θ and w = a+b. 22
The posterior predictive distribution
Often the objective of a statistical analysis is to build a stochastic model that can be used to make predictions of future events or impute missing values.
Let y ̃ be the future observation we would like to predict. Assume the observations are independent given the parameters,
and y ̃ follows the same model as the observed data. The posterior predictive distribution of y ̃ is
p(y ̃|y) = p(y ̃, θ|y)dθ = p(y ̃|θ, y)p(θ|y)dθ = p(y ̃|θ)p(θ|y)dθ 23
Notation
Summary of notation for distributions involving parameters and data:
Reich & Ghosh π(θ)
f (y|θ) m(y) p(θ|y) f∗(y∗|y)
Gelman et al p(θ) p(y|θ) p(y) p(θ|y) p(y ̃|y)
Terminology
Prior distribution Likelihood
Prior predictive distribution Posterior
Posterior predictive
Note Reich and Ghosh use non-bold faced θ and y if the param- eter or data vector is one-dimensional.
Know exactly how all these objects are related to each other! 24