COMP2610 / COMP6261 – Information Theory – Lecture 3: Probability Theory and Bayes’ Rule
COMP2610 / COMP6261 – Information Theory
Lecture 3: Probability Theory and Bayes’ Rule
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
July 30, 2018
1 / 34
Last time
A general communication system
Why do we need probability?
Basics of probability theory
Joint, marginal and conditional distributions
2 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0)
220/1000 Joint
p(B = 1)
100/1000 Marginal
p(A = 0)
690/1000 Marginal
p(B = 1|A = 1)
90/310 Conditional
p(A = 0|B = 0)
680/900 Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000
Joint
p(B = 1)
100/1000 Marginal
p(A = 0)
690/1000 Marginal
p(B = 1|A = 1)
90/310 Conditional
p(A = 0|B = 0)
680/900 Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000
Joint
p(B = 1) 100/1000
Marginal
p(A = 0)
690/1000 Marginal
p(B = 1|A = 1)
90/310 Conditional
p(A = 0|B = 0)
680/900 Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000
Joint
p(B = 1) 100/1000
Marginal
p(A = 0) 690/1000
Marginal
p(B = 1|A = 1)
90/310 Conditional
p(A = 0|B = 0)
680/900 Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000
Joint
p(B = 1) 100/1000
Marginal
p(A = 0) 690/1000
Marginal
p(B = 1|A = 1) 90/310
Conditional
p(A = 0|B = 0)
680/900 Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000
Joint
p(B = 1) 100/1000
Marginal
p(A = 0) 690/1000
Marginal
p(B = 1|A = 1) 90/310
Conditional
p(A = 0|B = 0) 680/900
Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000
Marginal
p(A = 0) 690/1000
Marginal
p(B = 1|A = 1) 90/310
Conditional
p(A = 0|B = 0) 680/900
Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000
Marginal
p(B = 1|A = 1) 90/310
Conditional
p(A = 0|B = 0) 680/900
Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000 Marginal
p(B = 1|A = 1) 90/310
Conditional
p(A = 0|B = 0) 680/900
Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000 Marginal
p(B = 1|A = 1) 90/310 Conditional
p(A = 0|B = 0) 680/900
Conditional
3 / 34
Review Exercise
Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
B = 0 B = 1
A = 0 680 10
A = 1 220 90
Then:
p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000 Marginal
p(B = 1|A = 1) 90/310 Conditional
p(A = 0|B = 0) 680/900 Conditional
3 / 34
This time
More on joint, marginal and conditional distributions
When can we say that X ,Y do not influence each other?
What, if anything, does p(X = x |Y = y) tell us about
p(Y = y |X = x)?
Philosophically related to “How do we know / learn about the world?”
I am not providing a general answer; but keep it in mind!
4 / 34
This time
More on joint, marginal and conditional distributions
When can we say that X ,Y do not influence each other?
What, if anything, does p(X = x |Y = y) tell us about
p(Y = y |X = x)?
Philosophically related to “How do we know / learn about the world?”
I am not providing a general answer; but keep it in mind!
4 / 34
This time
More on joint, marginal and conditional distributions
When can we say that X ,Y do not influence each other?
What, if anything, does p(X = x |Y = y) tell us about
p(Y = y |X = x)?
Philosophically related to “How do we know / learn about the world?”
I am not providing a general answer; but keep it in mind!
4 / 34
Outline
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
5 / 34
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
6 / 34
Document Modelling Example
Suppose we have a large document of English text, represented as a
sequence of characters:
x1x2x3 . . . xN
e.g. hello how are you
Treat each consecutive pair of characters as the outcome of “random
variables” X ,Y , i.e.
X = ‘h’, Y = ‘e’
X = ‘e’, Y = ‘l’
X = ‘l’, Y = ‘l’
…
7 / 34
Document Modelling: Marginal and Joint Distributionsi ai pi1 a 0.05752 b 0.01283
0.02634 d 0.02855 e 0.09136 f 0.01737 g 0.01338 h 0.03139 i 0.059910 j 0.000611 k 0.008412 l 0.033513 m 0.023514 n 0.059615 o 0.068916 p 0.019217 q 0.000818 r 0.050819 s 0.056720 t 0.070621 u 0.033422 v 0.006923 w 0.011924 x 0.007325 y 0.016426 z 0.000727 { 0.1928
ab
defghijklmnopqrstuvwxyz{ a b
d e f g h i j k l m n o p q r s t u v w x y z { y
ab
defghijklmnopqrstuvwxyz{
x
Unigram / Monogram Bigram
Marginal and joint distributions for English alphabet, estimated from the “FAQ
manual for Linux”. Figure from Mackay (ITILA, 2003); areas of squares proportional to probability (the right way to do it!).
8 / 34
Document Modelling: Conditional Distributions
ab
defghijklmnopqrstuvwxyz{ y
ab
defghijklmnopqrstuvwxyz{
x
ab
defghijklmnopqrstuvwxyz{ y
ab
defghijklmnopqrstuvwxyz{
x
(a) P (y jx) (b) P (x j y)
Conditional distributions for English alphabet, estimated from the “FAQ manual for
Linux”. Are these distributions “symmetric”? Figure from Mackay (ITILA, 2003)
P(X = x|Y = y) = P(Y = y|X = x)? P(X = x|Y = y) = P(X = y|Y = x)?.
9 / 34
Recap: Sum and Product Rules
Sum rule:
p(X = xi) =
∑
j
p(X = xi ,Y = yj)
Product rule:
p(X = xi ,Y = yj) = p(Y = yj |X = xi)p(X = xi)
10 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)?
Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)?
No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.
The difference in answers above is of great significance
B = 0 B = 1
A = 0 680 10
A = 1 220 90
B = 0 B = 1
A = 0 640 50
A = 1 260 50
These have the same marginals, but different joint distributions
11 / 34
Joint as the “Master” Distribution
In general, there can be many consistent joint distributions for a given set
of marginal distributions
The joint distribution is the “master” source of information about the
dependence
12 / 34
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
13 / 34
Recall: Fruit-Box Experiment
14 / 34
Statistical Independence
Suppose that both boxes (red and blue) contain the same proportion of
apples and oranges.
If fruit is selected uniformly at random from each box:
p(F = a|B = r) = p(F = a|B = b) (= p(F = a))
p(F = o|B = r) = p(F = o|B = b) (= p(F = o))
The probability of selecting an apple (or an orange) is independent of the
box that is chosen.
We may study the properties of F and B separately: this often simplifies
analysis
15 / 34
Statistical Independence
Suppose that both boxes (red and blue) contain the same proportion of
apples and oranges.
If fruit is selected uniformly at random from each box:
p(F = a|B = r) = p(F = a|B = b) (= p(F = a))
p(F = o|B = r) = p(F = o|B = b) (= p(F = o))
The probability of selecting an apple (or an orange) is independent of the
box that is chosen.
We may study the properties of F and B separately: this often simplifies
analysis
15 / 34
Statistical Independence: Definition
Definition: Independent Variables
Two variables X and Y are statistically independent, denoted X ⊥⊥ Y , if
and only if their joint distribution factorizes into the product of their
marginals:
X ⊥⊥ Y ↔ p(X ,Y ) = p(X )p(Y )
This definition generalises to more than two variables.
Are the variables in the language example statistically independent?
16 / 34
Statistical Independence: Definition
Definition: Independent Variables
Two variables X and Y are statistically independent, denoted X ⊥⊥ Y , if
and only if their joint distribution factorizes into the product of their
marginals:
X ⊥⊥ Y ↔ p(X ,Y ) = p(X )p(Y )
This definition generalises to more than two variables.
Are the variables in the language example statistically independent?
16 / 34
A Note on Notation
When we write
p(X ,Y ) = p(X )p(Y )
we have not specified the outcomes of X ,Y explicitly
This statement is a shorthand for
p(X = x ,Y = y) = p(X = x)p(Y = y)
for every possible x and y
This notation is sometimes called implied universality
17 / 34
Conditional independence
We may also consider random variables that are conditionally independent
given some other variable
Definition: Conditionally Independent Variables
Two variables X and Y are conditionally independent given Z , denoted
X ⊥⊥ Y |Z , if and only if
p(X ,Y |Z ) = p(X |Z )p(Y |Z )
Intuitively, Z is a common cause for X and Y
Example: X = whether I have a cold
Y = whether I have a headache
Z = whether I have the flu
18 / 34
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
19 / 34
Revisiting the Product Rule
The product rule tells us:
p(X ,Y ) = p(Y |X )p(X )
This can equivalently be interpreted as a definition of conditional
probability:
p(Y |X ) =
p(X ,Y )
p(X )
Can we use these to relate p(X |Y ) and p(Y |X )?
20 / 34
Posterior Inference:
Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.
Dicksy has tested positive (apparently sick)
What is the probability of Dicksy having the disease?
21 / 34
Posterior Inference:
Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.
Dicksy has tested positive (apparently sick)
What is the probability of Dicksy having the disease?
21 / 34
Posterior Inference:
Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.
Dicksy has tested positive (apparently sick)
What is the probability of Dicksy having the disease?
21 / 34
Posterior Inference:
Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.
Dicksy has tested positive (apparently sick)
What is the probability of Dicksy having the disease?
21 / 34
Posterior Inference:
Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.
Dicksy has tested positive (apparently sick)
What is the probability of Dicksy having the disease?
21 / 34
Posterior Inference:
Example 1: Formalization
Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the
outcome of the test:
p(D = 1) = 0.01 p(D = 0) = 0.99
p(T = 1|D = 1) = 0.95 p(T = 1|D = 0) = 0.04
p(T = 0|D = 1) = 0.05 p(T = 0|D = 0) = 0.96
We need to compute p(D = 1|T = 1), the probability of Dicksy having
the disease given that the test has resulted positive.
22 / 34
Posterior Inference:
Example 1: Formalization
Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the
outcome of the test:
p(D = 1) = 0.01 p(D = 0) = 0.99
p(T = 1|D = 1) = 0.95 p(T = 1|D = 0) = 0.04
p(T = 0|D = 1) = 0.05 p(T = 0|D = 0) = 0.96
We need to compute p(D = 1|T = 1), the probability of Dicksy having
the disease given that the test has resulted positive.
22 / 34
Posterior Inference:
Example 1: Formalization
Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the
outcome of the test:
p(D = 1) = 0.01 p(D = 0) = 0.99
p(T = 1|D = 1) = 0.95 p(T = 1|D = 0) = 0.04
p(T = 0|D = 1) = 0.05 p(T = 0|D = 0) = 0.96
We need to compute p(D = 1|T = 1), the probability of Dicksy having
the disease given that the test has resulted positive.
22 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Posterior Inference:
Example 1: Solution
p(D = 1|T = 1) =
p(D = 1,T = 1)
p(T = 1)
Def. conditional prob.
=
p(T = 1,D = 1)
p(T = 1)
Symmetry
=
p(T = 1|D = 1)p(D = 1)
p(T = 1)
Product rule
=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)
Sum rule
=
p(T = 1|D = 1)p(D = 1)
p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!
23 / 34
Why is the Probability So Low?
A “Natural Frequency” Approach
In 100 people, only 1 is expected to have the disease (p(D = 1) = 0.01)
This sick person will most likely test positive (p(T = 1|D = 1) = 0.95)
But around 4 healthy people are expected to be wrongly flagged as sick
(p(T = 1|D = 0) = 0.04, and 0.04× 99 ≈ 4)
So when the test is positive, the chance of being sick is ≈ 1/5
(Aside: If you can correctly perform the calculation on the previous slide, you are doing better than
most medical doctors! See Gird Gigerenzer and Adrian Edwards, Simple tools for understanding risks:
from innumeracy to insight, British Medical Journal, 327(7417), 741–744, 27 September 2003; Gird
Gigerenzer, Reckoning with risk: Learning to live with uncertainty, Penguin, 2002.
Moral of the story — if you get sick, don’t delegate conditional probability computations to your doctor!)
24 / 34
Why is the Probability So Low?
A “Natural Frequency” Approach
In 100 people, only 1 is expected to have the disease (p(D = 1) = 0.01)
This sick person will most likely test positive (p(T = 1|D = 1) = 0.95)
But around 4 healthy people are expected to be wrongly flagged as sick
(p(T = 1|D = 0) = 0.04, and 0.04× 99 ≈ 4)
So when the test is positive, the chance of being sick is ≈ 1/5
(Aside: If you can correctly perform the calculation on the previous slide, you are doing better than
most medical doctors! See Gird Gigerenzer and Adrian Edwards, Simple tools for understanding risks:
from innumeracy to insight, British Medical Journal, 327(7417), 741–744, 27 September 2003; Gird
Gigerenzer, Reckoning with risk: Learning to live with uncertainty, Penguin, 2002.
Moral of the story — if you get sick, don’t delegate conditional probability computations to your doctor!)
24 / 34
Bayes’ Theorem
We have implicitly used the following (at first glance remarkable) fact:
Bayes’ Theorem:
p(Z |X ) =
p(Z ,X )
p(X )
=
p(X ,Z )
p(X )
=
p(X |Z )p(Z )
p(X )
=
p(X |Z )p(Z )∑
Z ′ p(X |Z ′)p(Z ′)
If we can express what knowledge of X (test) tells us about Z (disease),
then we can express what knowledge of Z tells us about X
25 / 34
The Bayesian Inference Framework
Bayesian Inference
Bayesian inference provides a mathematical framework explaining how to
change our (prior) beliefs in the light of new evidence.
p(Z |X )︸ ︷︷ ︸
posterior
=
likelihood︷ ︸︸ ︷
p(X |Z )×
prior︷︸︸︷
p(Z )
p(X )︸ ︷︷ ︸
evidence
Prior: Belief that someone is sick
Likelihood: Probability of testing positive given you are sick
Posterior: Probability of being sick given you test positive
26 / 34
Posterior Inference:
Example 2 (Bishop, 2006)
Recall our fruit-box example:
The proportion of oranges and apples are given by
Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.
A piece of fruit has been picked up and it turned out to be an orange.
What is the probability that it came from the red box?
27 / 34
Posterior Inference:
Example 2 (Bishop, 2006)
Recall our fruit-box example:
The proportion of oranges and apples are given by
Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.
A piece of fruit has been picked up and it turned out to be an orange.
What is the probability that it came from the red box?
27 / 34
Posterior Inference:
Example 2 (Bishop, 2006)
Recall our fruit-box example:
The proportion of oranges and apples are given by
Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.
A piece of fruit has been picked up and it turned out to be an orange.
What is the probability that it came from the red box?
27 / 34
Posterior Inference:
Example 2 (Bishop, 2006)
Recall our fruit-box example:
The proportion of oranges and apples are given by
Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.
A piece of fruit has been picked up and it turned out to be an orange.
What is the probability that it came from the red box?
27 / 34
Posterior Inference:
Example 2: Formalization
Let B ∈ {r , b} denote the selected box and F ∈ {a, o} the selected fruit.
p(B = r) = 4/10 p(B = b) = 6/10
p(F = a|B = r) = 1/4 p(F = o|B = r) = 3/4
p(F = a|B = b) = 3/4 p(F = o|B = b) = 1/4
We need to compute p(B = r |F = o), the probability that a picked up
orange came from the red box.
28 / 34
Posterior Inference:
Example 2: Formalization
Let B ∈ {r , b} denote the selected box and F ∈ {a, o} the selected fruit.
p(B = r) = 4/10 p(B = b) = 6/10
p(F = a|B = r) = 1/4 p(F = o|B = r) = 3/4
p(F = a|B = b) = 3/4 p(F = o|B = b) = 1/4
We need to compute p(B = r |F = o), the probability that a picked up
orange came from the red box.
28 / 34
Posterior Inference:
Example 2: Formalization
Let B ∈ {r , b} denote the selected box and F ∈ {a, o} the selected fruit.
p(B = r) = 4/10 p(B = b) = 6/10
p(F = a|B = r) = 1/4 p(F = o|B = r) = 3/4
p(F = a|B = b) = 3/4 p(F = o|B = b) = 1/4
We need to compute p(B = r |F = o), the probability that a picked up
orange came from the red box.
28 / 34
Posterior Inference:
Example 2: Solution
We simply use Bayes’ rule:
p(B = r |F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
p(F = o|B = r)p(B = r)
p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)
=
2
3
and therefore p(B = b|F = o) = 1/3.
29 / 34
Posterior Inference:
Example 2: Solution
We simply use Bayes’ rule:
p(B = r |F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
p(F = o|B = r)p(B = r)
p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)
=
2
3
and therefore p(B = b|F = o) = 1/3.
29 / 34
Posterior Inference:
Example 2: Solution
We simply use Bayes’ rule:
p(B = r |F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
p(F = o|B = r)p(B = r)
p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)
=
2
3
and therefore p(B = b|F = o) = 1/3.
29 / 34
Posterior Inference:
Example 2: Solution
We simply use Bayes’ rule:
p(B = r |F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
p(F = o|B = r)p(B = r)
p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)
=
2
3
and therefore p(B = b|F = o) = 1/3.
29 / 34
Posterior Inference:
Example 2: Solution
We simply use Bayes’ rule:
p(B = r |F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
p(F = o|B = r)p(B = r)
p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)
=
2
3
and therefore p(B = b|F = o) = 1/3.
29 / 34
Posterior Inference:
Example 2: Interpretation of the Solution
If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box
I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one
I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it
I So after picking up the orange the red box is much more likely to have
been selected than the blue one
30 / 34
Posterior Inference:
Example 2: Interpretation of the Solution
If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box
I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one
I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it
I So after picking up the orange the red box is much more likely to have
been selected than the blue one
30 / 34
Posterior Inference:
Example 2: Interpretation of the Solution
If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box
I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one
I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it
I So after picking up the orange the red box is much more likely to have
been selected than the blue one
30 / 34
Posterior Inference:
Example 2: Interpretation of the Solution
If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box
I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one
I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it
I So after picking up the orange the red box is much more likely to have
been selected than the blue one
30 / 34
Posterior Inference:
Example 2: Interpretation of the Solution
If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box
I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one
I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it
I So after picking up the orange the red box is much more likely to have
been selected than the blue one
30 / 34
Posterior Inference:
Example 2: Interpretation of the Solution
If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box
I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one
I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it
I So after picking up the orange the red box is much more likely to have
been selected than the blue one
30 / 34
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
31 / 34
Summary
Recap on joint, marginal and conditional distributions
Interpretation of conditional probability
Statistical Independence
Bayes rule: combination of prior, likelihood to get a posterior
Reading: Mackay § 2.1, § 2.2 and § 2.3
32 / 34
Homework Exercise
Suppose we know that random variables X ,Y satisfy
p(X |Y ) = p(Y |X )
What can you conclude about the relationship between X and Y?
If X and Y are independent, does that imply p(X |Y ) = p(Y |X )?
Repeat the above questions for the statement
p(X |Y )
p(Y |X )
=
p(X )
p(Y )
33 / 34
Next time
More examples on Bayes’ theorem:
I Eating hamburgers
I Detecting terrorists
I The Monty Hall problem
I Document modelling
Are there notions of probability beyond frequency counting?
34 / 34
More on Joint, Marginal and Conditional Distributions
Statistical Independence
Bayes’ Theorem
Wrapping up