3. Bayes’ Theorem 3.1 Derivation
A straightforward application of conditioning: using f(x,y) = f(x|y)f(y) = f(y|x)f(x), we obtain Bayes’ theorem (also called Bayes’ rule)
f(x|y) = f(y|x) f(x). f(y)
• It applies to discrete and continuous random variables, and a mix of the two.
• Deceptively simple, and the predictions are counterintuitive. This is a very powerful
combination.
3.2 Examples of Bayes’ Theorem
Let’s do some “simple” examples, adapted from the book The Drunkard’s Walk – How Randomness Rules Our Lives by Leonard Mlodinow. (The solutions to these problems are given at the end of this chapter; however they will be made available only after the class. One of the purposes of this lecture is to demonstrate how counterintuitive probability can be.)
3.2.1 Girls named Lulu
Case 1: A family has two children. What is the probability that both are girls? Last update: 2020-01-21 at 14:56:33
3–1
Case 2: A family has two children. Given that one of them is a girl, what is the probability that both are girls?
Case 3: A family has two children. Given that one of them is a girl named Lulu, what is the probability that both are girls?
3.2.2 Disease diagnosis
Doctors were asked to estimate the probability that a woman with no symptoms, between 40 and 50 years old, who has a positive mammogram (i.e. the test reports cancer), actually has breast cancer. The doctors were given the following facts:
The facts
• 7% of mammograms give false positives.
• 10% of mammograms give false negatives.
• Actual incidence of breast cancer in her age group is 0.8%.
Results
• Germany: A third of doctors said 90%, the median estimate was 70%. • USA: 95% said the probability was approximately 75%.
What do you think? Aren’t you tempted to say 93%?
3.2.3 Prosecutor’s fallacy
The facts
• Couple’s first child died of SIDS (Sudden Infant Death Syndrome) at 11 weeks.
• They had another child, who also died of SIDS at 8 weeks.
• They were arrested and accused of smothering their children, even though there was no physical evidence.
3–2
Results
• The prosecutor brought in an expert, who made a simple argument: the odds of SIDS are 1 in 8,543. Therefore, for two children: 1 in 73 million.
• The couple was convicted. Do you agree with the conviction?
3.3 Combining Priors with Observations
Bayes’ theorem:
f(x|z) = f(z|x)f(x)
f(z)
Wherein:
x: unknown quantity of interest (in this class, usually the system state).
z: observation related to state (usually a sensor measurement).
f(x): prior belief of state.
f(z|x): observation model: for a given state, what is the probability of observing z? f(x|z): posterior belief of state, taking observation into account.
f(z) = f(z|x)f(x), by the total probability theorem (here, for discrete random vari- x
ables): probability of observation, essentially a normalization constant (does not depend on x).
• This is a systematic way of combining prior beliefs with observations.
• Question: If we observe the state x through z, doesn’t that tell us the state directly?
No!
– Dimension: dim(z) usually smaller than dim(x). – Noise: f(z|x) is not necessarily “sharp.”
– Want to combine prior knowledge / beliefs.
3–3
Generalization to Multiple Observations
• We have N observations, each of which can be vector valued: z1, . . . , zN .
• Often, we can assume conditional independence:
f(z1,…,zN|x) = f(z1|x)···f(zN|x)
One interpretation: each measurement is a function of the state x corrupted by
noise, and the noise is independent. Example: zi = f(w1)···f(wN).
gi(x, wi);
f(w1, . . . , wN ) =
• Then:
f(x|z1,…,zN) = posterior
i
prior
observation likelihood
f (x)
f(z1,…,zN)
f (zi |x)
normalization
• Normalization:
f(z1,…,zN) = f(x)f(zi|x)
x∈X i
by the total probability theorem (here, for discrete random variables).
Example
• Let x ∈ {0, 1} represent the truthful answer to a question, with 0 corresponding to a “No”, and 1 to a “Yes”. We ask two people the same question, and then we estimate what the truth is. Let zi be the answer given by person i (an observation), modeled as follows:
0 : truth zi =x+wi, i=1,2; wherew1 andw2 areindependent;andwi = 1: lie
and the ‘+’-operator is defined by
+: 0+0=0 0+1=1 1+0=1
1 + 1 = 0.
3–4
• For the prior, let f(x) = 1 for x = 0,1. This represents the state of maximum 2
ignorance, where we take all states to be a priori equally likely. We model the truthfulness of person i’s answer as:
wi : fwi(0)=pi, fwi(1)=1−pi.
That is, person i tells the truth with probability pi.
• By Bayes’ rule:
f(x|z1, z2) = f(x) f(z1|x) f(z2|x).
f(z1,z2)
We build tables for f(x) f(z1|x) f(z2|x) and f(z1, z2):
x z1 z2 f(x)f(z1|x)f(z2|x)
z1 z2 f(z1,z2)
0 0 0.5(p1p2 +(1−p1)(1−p2))
0 0 0 0 0 1
0 1
1 0
1 0 1 1 1 1
0 1 0 1 0 1 0 1
0.5p1p2
0.5p1(1 − p2)
0.5(1 − p1)p2 0.5(1−p1)(1−p2) 0.5(1−p1)(1−p2) 0.5(1 − p1 )p2
0.5p1 (1 − p2 ) 0.5p1p2
• Special case: p1 = 1 . Our intuition is that the answer z1 is not useful for determining 2
the truth, and this turns out to be correct:
0.5(p1(1−p2)+(1−p1)p2) 0.5((1−p1)p2+p1(1−p2))
0 1
1 0
1 1 0.5((1−p1)(1−p2)+p1p2)
• Then
fx|z1z2 (0|0, 0) =
p1p2
p1p2 +(1−p1)(1−p2)
(1 − p1)(1 − p2) p1p2 +(1−p1)(1−p2)
, fx|z1z2 (1|0, 0) = true answer being “Yes” when both persons say “No,” etc.
, etc. are the probabilities of the true answer being “No” when both persons say “No,” the
– fx|z1z2(0|0,0)=p2
– fx|z1z2 (0|1, 0) = p2, etc.
3–5