3. Bayes’ Theorem 3.1 Derivation
A straightforward application of conditioning: using f(x,y) = f(x|y)f(y) = f(y|x)f(x), we obtain Bayes’ theorem (also called Bayes’ rule)
f(x|y) = f(y|x) f(x). f(y)
• It applies to discrete and continuous random variables, and a mix of the two.
• Deceptively simple, and the predictions are counterintuitive. This is a very powerful
combination.
3.2 Examples of Bayes’ Theorem
Let’s do some “simple” examples, adapted from the book The Drunkard’s Walk – How Randomness Rules Our Lives by Leonard Mlodinow. (The solutions to these problems are given at the end of this chapter; however they will be made available only after the class. One of the purposes of this lecture is to demonstrate how counterintuitive probability can be.)
3.2.1 Girls named Lulu
Case 1: A family has two children. What is the probability that both are girls? Last update: 2020-01-23 at 09:30:37
3–1
Case 2: A family has two children. Given that one of them is a girl, what is the probability that both are girls?
Case 3: A family has two children. Given that one of them is a girl named Lulu, what is the probability that both are girls?
3.2.2 Disease diagnosis
Doctors were asked to estimate the probability that a woman with no symptoms, between 40 and 50 years old, who has a positive mammogram (i.e. the test reports cancer), actually has breast cancer. The doctors were given the following facts:
The facts
• 7% of mammograms give false positives.
• 10% of mammograms give false negatives.
• Actual incidence of breast cancer in her age group is 0.8%.
Results
• Germany: A third of doctors said 90%, the median estimate was 70%. • USA: 95% said the probability was approximately 75%.
What do you think? Aren’t you tempted to say 93%?
3.2.3 Prosecutor’s fallacy
The facts
• Couple’s first child died of SIDS (Sudden Infant Death Syndrome) at 11 weeks.
• They had another child, who also died of SIDS at 8 weeks.
• They were arrested and accused of smothering their children, even though there was no physical evidence.
3–2
Results
• The prosecutor brought in an expert, who made a simple argument: the odds of SIDS are 1 in 8,543. Therefore, for two children: 1 in 73 million.
• The couple was convicted. Do you agree with the conviction?
3.3 Combining Priors with Observations
Bayes’ theorem:
f(x|z) = f(z|x)f(x)
f(z)
Wherein:
x: unknown quantity of interest (in this class, usually the system state).
z: observation related to state (usually a sensor measurement).
f(x): prior belief of state.
f(z|x): observation model: for a given state, what is the probability of observing z? f(x|z): posterior belief of state, taking observation into account.
f(z) = f(z|x)f(x), by the total probability theorem (here, for discrete random vari- x
ables): probability of observation, essentially a normalization constant (does not depend on x).
• This is a systematic way of combining prior beliefs with observations.
• Question: If we observe the state x through z, doesn’t that tell us the state directly?
No!
– Dimension: dim(z) usually smaller than dim(x). – Noise: f(z|x) is not necessarily “sharp.”
– Want to combine prior knowledge / beliefs.
3–3
Generalization to Multiple Observations
• We have N observations, each of which can be vector valued: z1, . . . , zN .
• Often, we can assume conditional independence:
f(z1,…,zN|x) = f(z1|x)···f(zN|x)
One interpretation: each measurement is a function of the state x corrupted by
noise, and the noise is independent. Example: zi = f(w1)···f(wN).
gi(x, wi);
f(w1, . . . , wN ) =
• Then:
f(x|z1,…,zN) = posterior
i
prior
observation likelihood
f (x)
f(z1,…,zN)
f (zi |x)
normalization
• Normalization:
f(z1,…,zN) = f(x)f(zi|x)
x∈X i
by the total probability theorem (here, for discrete random variables).
Example
• Let x ∈ {0, 1} represent the truthful answer to a question, with 0 corresponding to a “No”, and 1 to a “Yes”. We ask two people the same question, and then we estimate what the truth is. Let zi be the answer given by person i (an observation), modeled as follows:
0 : truth zi =x+wi, i=1,2; wherew1 andw2 areindependent;andwi = 1: lie
and the ‘+’-operator is defined by
+: 0+0=0 0+1=1 1+0=1
1 + 1 = 0.
3–4
• For the prior, let f(x) = 1 for x = 0,1. This represents the state of maximum 2
ignorance, where we take all states to be a priori equally likely. We model the truthfulness of person i’s answer as:
wi : fwi(0)=pi, fwi(1)=1−pi.
That is, person i tells the truth with probability pi.
• By Bayes’ rule:
f(x|z1, z2) = f(x) f(z1|x) f(z2|x).
f(z1,z2)
We build tables for f(x) f(z1|x) f(z2|x) and f(z1, z2):
x z1 z2 f(x)f(z1|x)f(z2|x)
z1 z2 f(z1,z2)
0 0 0.5(p1p2 +(1−p1)(1−p2))
0 0 0 0 0 1
0 1
1 0
1 0 1 1 1 1
0 1 0 1 0 1 0 1
0.5p1p2
0.5p1(1 − p2)
0.5(1 − p1)p2 0.5(1−p1)(1−p2) 0.5(1−p1)(1−p2) 0.5(1 − p1 )p2
0.5p1 (1 − p2 ) 0.5p1p2
• Special case: p1 = 1 . Our intuition is that the answer z1 is not useful for determining 2
the truth, and this turns out to be correct:
0.5(p1(1−p2)+(1−p1)p2) 0.5((1−p1)p2+p1(1−p2))
0 1
1 0
1 1 0.5((1−p1)(1−p2)+p1p2)
• Then
fx|z1z2 (0|0, 0) =
p1p2
p1p2 +(1−p1)(1−p2)
(1 − p1)(1 − p2) p1p2 +(1−p1)(1−p2)
, fx|z1z2 (1|0, 0) = true answer being “Yes” when both persons say “No,” etc.
, etc. are the probabilities of the true answer being “No” when both persons say “No,” the
– fx|z1z2(0|0,0)=p2
– fx|z1z2 (0|1, 0) = p2, etc.
2–5
2.4 Solutions to Bayes’ Theorem Examples of Section ?? 2.4.1 Girls named Lulu
Case 1: A family has two children. What is the probability that both are girls? Hopefully you said 1. (Assuming that the probability of a child being a boy or a girl is 1.)
Case 2: A family has two children. Given that one of them is a girl, what is the probability that both are girls? The answer is 1, which can be seen as follows:
42
3
1 : no boys in the family 0 : boy in the family
1 : no girls in the family
• Define: x =
• Then fx(1) = 1, fx(0) = 3, fy(1) = 1, fy(0) = 3
y =
0 : girl in the family
4444
fy|x(0|1)fx(1) 1· 1 1 • Therefore fx|y(1|0) = fy(0) = 34 = 3
4
Alternative approach:
• Generally, there are four possible gender combinations for the two children, which are all equally likely: (B,B), (B,G), (G,B), (G,G).
• Knowing that one of the children is a girl, only three options remain: (B,G), (G,B), (G,G). Therefore, the probability of (G,G) is 1 out of 3.
Case 3: A family has two children. Given that one of them is a girl named Lulu, what is the probability that both are girls?
Here, the key is that Lulu is an unusual name for a girl.
1 : no boys in the family 0 : boy in the family
fy (0)
• What is fy|x(0|1)? Given that they are both girls, what is probability that there is
a girl named Lulu in the family? Let p be the probability that someone names a 2–6
• Define: x =
• We want to know: fx|y(1|0) = fy|x(0|1)fx(1)
y =
1 : no girl named Lulu in the family 0 : girl named Lulu in the family
girl Lulu, where we assume that p ≪ 1 (do you know anyone named Lulu?). Then consider the random variables c1 and c2, where c1 refers to the first child, c2 to the second child:
c1 c2 fc1,c2|x(c1, c2|1) 0 0 (1−p)(1−p) 0 1 (1−p)p
1 0 p
110
0: notnamedLulu 1:namedLulu
ci =
Therefore fy|x(0|1) = 2p − p2 ≈ 2p since p ≪ 1.
• How about fy(0), the probability that there is a girl named Lulu in the family? Proceed similarly to above:
c1 c2 f(c1,c2) 0 0 0.25
0: boy
1 : girl, not named Lulu 2 : girl, named Lulu
0 1
0 2 0.25p
0.25(1 − p) 0.25(1 − p)
ci =
1 0
1 1 0.25(1 − p)2
1 2
2 0 0.25p 2 1 0.25p 220
0.25p(1 − p)
(Sanity check (sum): 0.25(1+1−p+p+1−p+1−2p+p2 +p−p2 +2p) = 0.25(4) = 1.) Therefore, fy (0) = 0.75p + 0.25p − 0.25p2 ≈ p.
• From the above, we get
fy|x(0|1)fx(1) (2p−p2)· 1 2p· 1 1 fx|y(1|0)= = 4≈ 4=.
is thus 1. 2
fy(0) p− 1p2 p≪1 p 2 4
Even though you can follow all the steps, the end result probably still seems wrong.
An alternate (frequentist) way to calculate this: assume 1 in 1000 girls is named Lulu. Of 100,000 families with two children, 75,000 will have at least one girl: 50,000 will have a girl and a boy, and 25,000 will have two girls. Of the 50,000 girl/boy families, we expect 50 to have a girl named Lulu. Of the 25,000 girl/girl families, we expect 50 to have a girl named Lulu: 25 where the first-born is Lulu, 25 where the second born is Lulu. The probability
2–7
Note: there’s a lot of subtlety at work here, and the above answers aren’t unambiguous (specifically, how was the given information obtained?). Alternative interpretations of the problem yield different answers, see R. Falk, “When truisms clash: Coping with a coun- terintuitive problem concerning the notorious two-child family” in Thinking & Reasoning, 2011, 147 (4), pp. 353 – 366.
2.4.2 Disease diagnosis
Define the following random variables:
1 : patient does not have cancer
x= y=
0 : patient has cancer • We want to know:
fx|y(0|0) = fy|x(0|0) fx(0). fy (0)
1 : test provides a negative result 0 : test provides a positive result
• We calculate
fy|x(0|0) = 0.9 (false negative rate = 10%), fx(0) = 0.008,
and, using the total probability theorem f(y) = x f(y|x)f(x), fy(0) = fy|x(0|0) fx(0) + fy|x(0|1) fx(1)
= 0.90 · 0.008 + 0.07 · 0.992. • Therefore, we get
fx|y(0|0) = 0.9 · 0.008 ≈ 0.094. 0.9 · 0.008 + 0.07 · 0.992
The probability is 9.4%! Reason: most positive results are due to false positives.
2.4.3 Prosecutor’s fallacy
Problems with the conviction:
2–8
• The events are not independent. A more detailed study showed, chances of two cases of SIDS are 1 in 2.75 million – however, these are still small odds.
• Inversion: the probability that two children die of SIDS is actually of no interest. What we want to know:
(1) Given that two children have died, what is the probability that they died of SIDS?
(2) Given that two children have died, what is the probability that they were mur- dered?
These were calculated, and (1) is nine times more likely than (2). The conviction was eventually overturned.
2–9