Introduction to Bayesian Inference.
Introduction to Bayesian Inference.
Elena Moltchanova
STAT314/461-2021S1
Rev.Thomas Bayes ()
An Essay towards solving a Problem in the Doctrine of Chances. By
the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to
John Canton, M. A. and F. R. S.
in
Philosophical Transactions of the Royal Society of London 53
(1763), 379-418
Experienced
Trainee
5/7
2/7
Excellent
Good
So−So
Excellent
Good
So−So
0.80
0.15
0.05
0.20
0.50
0.30
Pr(good coffee|trainee barista)
1. Meet barista
2. Barista makes coffee
Bayes’ Theorem aka Inverse Probability Formula.
Consider a set of mutually exhaustive and mutually exclusive events
A1,A2, …,AK and an event B. Assume that the probabilities Pr(Ak)
and Pr(B|Ak) are known for all k = 1, …,K .Then, for some j ,
Pr(Aj |B) =
Pr(B|Aj)Pr(Aj)∑
k Pr(B|Ak)Pr(Ak)
.
Proof.
∑
k
Pr(B|Ak)Pr(Ak) =
∑
k
Pr(B&Ak) = Pr(B).
Therefore:
Pr(Aj |B) =
Pr(B|Aj)Pr(Aj)
Pr(B)
.
Multiplying both sides by Pr(B):
Pr(Aj |B)Pr(B) = Pr(B|Aj)Pr(Aj).
Pr(Aj&B) = Pr(B&Aj).
Thus, the equality holds.
Alternatively:
Pr(A|B) =
Pr(B|A)Pr(A)
Pr(B)
.
Back to coffee:
Given that I am drinking a cup of a very good coffee, what is
the probability that it was made by the trainee barista?
In other words, let’s think of a set of K = 2 mutually exlusive and
mutually exhaustive events: A1 = experienced, and A2 = trainee.
And the event of interest B = excellent coffee.
Experienced
Trainee
5/7
2/7
Excellent
Good
So−So
Excellent
Good
So−So
0.80
0.15
0.05
0.20
0.50
0.30
Put numbers into Bayes’ Formula:
We know that Pr(A1) = 5/7 and Pr(A2) = 2/7. We also know that
Pr(B|A1) = 0.80 and Pr(B|A2) = 0.20. We can use Bayes’
Theorem to obtain our quantity of interest:
Pr(A1|B) =
Pr(B|A1)Pr(A1)
Pr(B|A1)Pr(A1) + Pr(B|A2)Pr(A2)
=
5/7 ∗ 0.80
5/7 ∗ 0.80 + 2/7 ∗ .20
= 10/11 ≈ 0.91.
Using the “tree”:
Experienced
Trainee
5/7
2/7
Excellent
Good
So−So
Excellent
Good
So−So
0.80
0.15
0.05
0.20
0.50
0.30
Another cup
Experienced
Trainee
10/11
1/11
Excellent
Good
So−So
Excellent
Good
So−So
0.80
0.15
0.05
0.20
0.50
0.30
Another cup: so-so
Pr(Experienced|Excellent,So-so) =
10/11 ∗ .05
10/11 ∗ .05 + 1/11 ∗ .30
= 0.625.
Two cups at once:
The probability that you get and Excellent and a So-so cup from the
experienced barista is 0.80 ∗ 0.05 = 0.04. The probability that you
get the same from the trainee is 0.20 ∗ 0.30 = 0.06. Applying Bayes’
Theorem we get
Pr(Experienced|Excellent,So-so) =
Pr(Ex,Ss|Exp)Pr(Exp)
Pr(Ex,Ss|Exp)Pr(Exp) + Pr(Ex,Ss|Tr)Pr(Tr)
=
0.04 ∗ 5/7
0.04 ∗ 5/7 + 0.06 ∗ 2/7
= 0.625.
Using Bayes Formula:
I The object of inference – the probability that the particular
barista is at work today – is constantly updated in light of the
data.
I This natural learning process is happening sequentially – cup by
cup. (Or batch by batch).
I You can easily include other people’s observations into your
process.
Refresher on the Probability Distributions – 1
Consider two random variables x and y with the corresponding
probability density functions (p.d.f.) f (x) and f (y). Note, that we
will use f () as a generic notation for any p.d.f.
The following properties apply to any p.d.f.:
1. f (x) ≥ 0 ∀x .
2.
∫∞
−∞ f (x)dx = 1.
Refresher on the Probability Distributions – 2:
The cumulative density function (c.d.f.) is defined as
F (x) = Pr(X ≤ x) =
∫ x
−∞
f (x)dx .
Refresher on the Probability Distributions – 3:
The joint distribution of x and y is
f (x , y) = f (x |y)f (y) = f (y |x)f (x).
Here, f (y |x) is referred to as conditional p.d.f. of y given x .
Note, that when x and y are independent,
f (x , y) = f (x)f (y).
Refresher on the Probability Distributions – 4:
The marginal p.d.f. f (x) can be obtained as:
f (x) =
∫ ∞
−∞
f (x , y)dy .
Refresher on the Probability Distributions – 5:
In general, for random variables x1, x2, x3, …, xK with the joint p.d.f.
f (x1, x2, x3, …, xK ) the following chain rule applies:
f (x1, x2, x3, …, xK ) = f (x1|x2, x3, …, xK )f (x2|x3, …, xK )…f (xK ).
Bayes formula for distributions-1
Let f (x |θ) denote the p.d.f of data x given parameter θ, and let
f (θ) denote the p.d.f. of the parameter θ. Then:
f (θ|x) =
f (x |θ)f (θ)∫
Θ f (x |θ)f (θ)dθ
.
Bayes formula for distributions-2
Note, it is easy to check that this holds:
∫
Θ
f (x |θ)f (θ)dθ =
∫
Θ
f (x , θ)dθ = f (x).
I.e.,
f (θ|x) =
f (x |θ)f (θ)
f (x)
.
Classical vs. Bayesian Inference:
Classical:
I Experiments are infinitely repeatable under the same conditions
(hence: ’frequentist’)
I The parameter of interest (θ) is fixed and unknown
I Inference via Maximum Likelihood
Bayesian:
I Each experiment is unique (i.e., not repeatable)
I The parameter of interest has an unknown distribution
I Inference via Bayes’ Theorem
What is prior distribution?
I Prior expresses our knowledge about the parameter distribution
before the experiment. It may be based on some general
considerations (a binomial probability has to lie between 0 and
1; average human height must lie between 140 and 190) or on
previous experiments (the first diagnostic test was positive).
I If no information is available, a so-called vague or uninformative
prior can be used. (BUT: what is uninformative?)
I Different statisticians may have different priors. Sensitivity
analysis is important.