程序代写代做代考 information theory Bayesian AI COMP2610 / COMP6261 – Information Theory – Lecture 3: Probability Theory and Bayes’ Rule

COMP2610 / COMP6261 – Information Theory – Lecture 3: Probability Theory and Bayes’ Rule

COMP2610 / COMP6261 – Information Theory
Lecture 3: Probability Theory and Bayes’ Rule

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

July 30, 2018

1 / 34

Last time

A general communication system

Why do we need probability?

Basics of probability theory

Joint, marginal and conditional distributions

2 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0)

220/1000 Joint

p(B = 1)

100/1000 Marginal

p(A = 0)

690/1000 Marginal

p(B = 1|A = 1)

90/310 Conditional

p(A = 0|B = 0)

680/900 Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000

Joint

p(B = 1)

100/1000 Marginal

p(A = 0)

690/1000 Marginal

p(B = 1|A = 1)

90/310 Conditional

p(A = 0|B = 0)

680/900 Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000

Joint

p(B = 1) 100/1000

Marginal

p(A = 0)

690/1000 Marginal

p(B = 1|A = 1)

90/310 Conditional

p(A = 0|B = 0)

680/900 Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000

Joint

p(B = 1) 100/1000

Marginal

p(A = 0) 690/1000

Marginal

p(B = 1|A = 1)

90/310 Conditional

p(A = 0|B = 0)

680/900 Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000

Joint

p(B = 1) 100/1000

Marginal

p(A = 0) 690/1000

Marginal

p(B = 1|A = 1) 90/310

Conditional

p(A = 0|B = 0)

680/900 Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000

Joint

p(B = 1) 100/1000

Marginal

p(A = 0) 690/1000

Marginal

p(B = 1|A = 1) 90/310

Conditional

p(A = 0|B = 0) 680/900

Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000

Marginal

p(A = 0) 690/1000

Marginal

p(B = 1|A = 1) 90/310

Conditional

p(A = 0|B = 0) 680/900

Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000

Marginal

p(B = 1|A = 1) 90/310

Conditional

p(A = 0|B = 0) 680/900

Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000 Marginal
p(B = 1|A = 1) 90/310

Conditional

p(A = 0|B = 0) 680/900

Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000 Marginal
p(B = 1|A = 1) 90/310 Conditional
p(A = 0|B = 0) 680/900

Conditional

3 / 34

Review Exercise

Suppose I go through the records for N = 1000 students, checking their
admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}

(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see
e.g. Dean Simonton, Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)

Say that the counts for admission and brilliance are

B = 0 B = 1
A = 0 680 10
A = 1 220 90

Then:

p(A = 1,B = 0) 220/1000 Joint
p(B = 1) 100/1000 Marginal
p(A = 0) 690/1000 Marginal
p(B = 1|A = 1) 90/310 Conditional
p(A = 0|B = 0) 680/900 Conditional

3 / 34

This time

More on joint, marginal and conditional distributions

When can we say that X ,Y do not influence each other?

What, if anything, does p(X = x |Y = y) tell us about
p(Y = y |X = x)?

Philosophically related to “How do we know / learn about the world?”
I am not providing a general answer; but keep it in mind!

4 / 34

This time

More on joint, marginal and conditional distributions

When can we say that X ,Y do not influence each other?

What, if anything, does p(X = x |Y = y) tell us about
p(Y = y |X = x)?

Philosophically related to “How do we know / learn about the world?”

I am not providing a general answer; but keep it in mind!

4 / 34

This time

More on joint, marginal and conditional distributions

When can we say that X ,Y do not influence each other?

What, if anything, does p(X = x |Y = y) tell us about
p(Y = y |X = x)?

Philosophically related to “How do we know / learn about the world?”
I am not providing a general answer; but keep it in mind!

4 / 34

Outline

1 More on Joint, Marginal and Conditional Distributions

2 Statistical Independence

3 Bayes’ Theorem

4 Wrapping up

5 / 34

1 More on Joint, Marginal and Conditional Distributions

2 Statistical Independence

3 Bayes’ Theorem

4 Wrapping up

6 / 34

Document Modelling Example

Suppose we have a large document of English text, represented as a
sequence of characters:

x1x2x3 . . . xN

e.g. hello how are you

Treat each consecutive pair of characters as the outcome of “random
variables” X ,Y , i.e.

X = ‘h’, Y = ‘e’

X = ‘e’, Y = ‘l’

X = ‘l’, Y = ‘l’

7 / 34

Document Modelling: Marginal and Joint Distributionsi ai pi1 a 0.05752 b 0.01283
0.02634 d 0.02855 e 0.09136 f 0.01737 g 0.01338 h 0.03139 i 0.059910 j 0.000611 k 0.008412 l 0.033513 m 0.023514 n 0.059615 o 0.068916 p 0.019217 q 0.000818 r 0.050819 s 0.056720 t 0.070621 u 0.033422 v 0.006923 w 0.011924 x 0.007325 y 0.016426 z 0.000727 { 0.1928

ab
defghijklmnopqrstuvwxyz{ a b
d e f g h i j k l m n o p q r s t u v w x y z { y

ab
defghijklmnopqrstuvwxyz{

x

Unigram / Monogram Bigram

Marginal and joint distributions for English alphabet, estimated from the “FAQ
manual for Linux”. Figure from Mackay (ITILA, 2003); areas of squares proportional to probability (the right way to do it!).

8 / 34

Document Modelling: Conditional Distributions

ab
defghijklmnopqrstuvwxyz{ y

ab
defghijklmnopqrstuvwxyz{

x

ab
defghijklmnopqrstuvwxyz{ y

ab
defghijklmnopqrstuvwxyz{

x

(a) P (y jx) (b) P (x j y)
Conditional distributions for English alphabet, estimated from the “FAQ manual for
Linux”. Are these distributions “symmetric”? Figure from Mackay (ITILA, 2003)
P(X = x|Y = y) = P(Y = y|X = x)? P(X = x|Y = y) = P(X = y|Y = x)?.

9 / 34

Recap: Sum and Product Rules

Sum rule:

p(X = xi) =

j

p(X = xi ,Y = yj)

Product rule:

p(X = xi ,Y = yj) = p(Y = yj |X = xi)p(X = xi)

10 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)?

Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)?

No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Relating the Marginal, Conditional and Joint

Suppose we knew p(X = x ,Y = y) for all values of x , y . Could we
compute all of p(X = x |Y = y), p(X = x) and p(Y = y)? Yes.

Now suppose we knew p(X = x) and p(Y = y) for all values of x , y .
Could we compute p(X = x ,Y = y) or p(X = x |Y = y)? No.

The difference in answers above is of great significance

B = 0 B = 1
A = 0 680 10
A = 1 220 90

B = 0 B = 1
A = 0 640 50
A = 1 260 50

These have the same marginals, but different joint distributions

11 / 34

Joint as the “Master” Distribution

In general, there can be many consistent joint distributions for a given set
of marginal distributions

The joint distribution is the “master” source of information about the
dependence

12 / 34

1 More on Joint, Marginal and Conditional Distributions

2 Statistical Independence

3 Bayes’ Theorem

4 Wrapping up

13 / 34

Recall: Fruit-Box Experiment

14 / 34

Statistical Independence

Suppose that both boxes (red and blue) contain the same proportion of
apples and oranges.

If fruit is selected uniformly at random from each box:

p(F = a|B = r) = p(F = a|B = b) (= p(F = a))
p(F = o|B = r) = p(F = o|B = b) (= p(F = o))

The probability of selecting an apple (or an orange) is independent of the
box that is chosen.

We may study the properties of F and B separately: this often simplifies
analysis

15 / 34

Statistical Independence

Suppose that both boxes (red and blue) contain the same proportion of
apples and oranges.

If fruit is selected uniformly at random from each box:

p(F = a|B = r) = p(F = a|B = b) (= p(F = a))
p(F = o|B = r) = p(F = o|B = b) (= p(F = o))

The probability of selecting an apple (or an orange) is independent of the
box that is chosen.

We may study the properties of F and B separately: this often simplifies
analysis

15 / 34

Statistical Independence: Definition

Definition: Independent Variables
Two variables X and Y are statistically independent, denoted X ⊥⊥ Y , if
and only if their joint distribution factorizes into the product of their
marginals:

X ⊥⊥ Y ↔ p(X ,Y ) = p(X )p(Y )

This definition generalises to more than two variables.

Are the variables in the language example statistically independent?

16 / 34

Statistical Independence: Definition

Definition: Independent Variables
Two variables X and Y are statistically independent, denoted X ⊥⊥ Y , if
and only if their joint distribution factorizes into the product of their
marginals:

X ⊥⊥ Y ↔ p(X ,Y ) = p(X )p(Y )

This definition generalises to more than two variables.

Are the variables in the language example statistically independent?

16 / 34

A Note on Notation

When we write
p(X ,Y ) = p(X )p(Y )

we have not specified the outcomes of X ,Y explicitly

This statement is a shorthand for

p(X = x ,Y = y) = p(X = x)p(Y = y)

for every possible x and y

This notation is sometimes called implied universality

17 / 34

Conditional independence

We may also consider random variables that are conditionally independent
given some other variable

Definition: Conditionally Independent Variables
Two variables X and Y are conditionally independent given Z , denoted
X ⊥⊥ Y |Z , if and only if

p(X ,Y |Z ) = p(X |Z )p(Y |Z )

Intuitively, Z is a common cause for X and Y

Example: X = whether I have a cold
Y = whether I have a headache
Z = whether I have the flu

18 / 34

1 More on Joint, Marginal and Conditional Distributions

2 Statistical Independence

3 Bayes’ Theorem

4 Wrapping up

19 / 34

Revisiting the Product Rule

The product rule tells us:

p(X ,Y ) = p(Y |X )p(X )

This can equivalently be interpreted as a definition of conditional
probability:

p(Y |X ) =
p(X ,Y )

p(X )

Can we use these to relate p(X |Y ) and p(Y |X )?

20 / 34

Posterior Inference:
Example 1 (Mackay, 2003)

Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease

The test simply classifies a person as having the disease, or not

The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time

p(identifies sick | sick) = 95%.

I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.

Dicksy has tested positive (apparently sick)

What is the probability of Dicksy having the disease?

21 / 34

Posterior Inference:
Example 1 (Mackay, 2003)

Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease

The test simply classifies a person as having the disease, or not

The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time

p(identifies sick | sick) = 95%.

I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.

Dicksy has tested positive (apparently sick)

What is the probability of Dicksy having the disease?

21 / 34

Posterior Inference:
Example 1 (Mackay, 2003)

Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease

The test simply classifies a person as having the disease, or not

The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time

p(identifies sick | sick) = 95%.

I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.

Dicksy has tested positive (apparently sick)

What is the probability of Dicksy having the disease?

21 / 34

Posterior Inference:
Example 1 (Mackay, 2003)

Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease

The test simply classifies a person as having the disease, or not

The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time

p(identifies sick | sick) = 95%.

I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.

Dicksy has tested positive (apparently sick)

What is the probability of Dicksy having the disease?

21 / 34

Posterior Inference:
Example 1 (Mackay, 2003)

Dicksy Sick had a test for a rare disease
I Only 1% people of Dicksy’s background have the disease

The test simply classifies a person as having the disease, or not

The test is reliable, but not infallible
I It correctly identifies a sick individual 95% of the time

p(identifies sick | sick) = 95%.

I It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%.

Dicksy has tested positive (apparently sick)

What is the probability of Dicksy having the disease?

21 / 34

Posterior Inference:
Example 1: Formalization

Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the
outcome of the test:

p(D = 1) = 0.01 p(D = 0) = 0.99

p(T = 1|D = 1) = 0.95 p(T = 1|D = 0) = 0.04
p(T = 0|D = 1) = 0.05 p(T = 0|D = 0) = 0.96

We need to compute p(D = 1|T = 1), the probability of Dicksy having
the disease given that the test has resulted positive.

22 / 34

Posterior Inference:
Example 1: Formalization

Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the
outcome of the test:

p(D = 1) = 0.01 p(D = 0) = 0.99

p(T = 1|D = 1) = 0.95 p(T = 1|D = 0) = 0.04
p(T = 0|D = 1) = 0.05 p(T = 0|D = 0) = 0.96

We need to compute p(D = 1|T = 1), the probability of Dicksy having
the disease given that the test has resulted positive.

22 / 34

Posterior Inference:
Example 1: Formalization

Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the
outcome of the test:

p(D = 1) = 0.01 p(D = 0) = 0.99

p(T = 1|D = 1) = 0.95 p(T = 1|D = 0) = 0.04
p(T = 0|D = 1) = 0.05 p(T = 0|D = 0) = 0.96

We need to compute p(D = 1|T = 1), the probability of Dicksy having
the disease given that the test has resulted positive.

22 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)

≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Posterior Inference:
Example 1: Solution

p(D = 1|T = 1) =
p(D = 1,T = 1)

p(T = 1)
Def. conditional prob.

=
p(T = 1,D = 1)

p(T = 1)
Symmetry

=
p(T = 1|D = 1)p(D = 1)

p(T = 1)
Product rule

=
p(T = 1|D = 1)p(D = 1)∑
d p(T = 1|D = d)p(D = d)

Sum rule

=
p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)
≈ 0.19.

Despite testing positive and the high accuracy of the test, the probability
of Dicksy having the disease is only 0.19!

23 / 34

Why is the Probability So Low?
A “Natural Frequency” Approach

In 100 people, only 1 is expected to have the disease (p(D = 1) = 0.01)

This sick person will most likely test positive (p(T = 1|D = 1) = 0.95)

But around 4 healthy people are expected to be wrongly flagged as sick
(p(T = 1|D = 0) = 0.04, and 0.04× 99 ≈ 4)

So when the test is positive, the chance of being sick is ≈ 1/5

(Aside: If you can correctly perform the calculation on the previous slide, you are doing better than
most medical doctors! See Gird Gigerenzer and Adrian Edwards, Simple tools for understanding risks:
from innumeracy to insight, British Medical Journal, 327(7417), 741–744, 27 September 2003; Gird
Gigerenzer, Reckoning with risk: Learning to live with uncertainty, Penguin, 2002.

Moral of the story — if you get sick, don’t delegate conditional probability computations to your doctor!)

24 / 34

Why is the Probability So Low?
A “Natural Frequency” Approach

In 100 people, only 1 is expected to have the disease (p(D = 1) = 0.01)

This sick person will most likely test positive (p(T = 1|D = 1) = 0.95)

But around 4 healthy people are expected to be wrongly flagged as sick
(p(T = 1|D = 0) = 0.04, and 0.04× 99 ≈ 4)

So when the test is positive, the chance of being sick is ≈ 1/5

(Aside: If you can correctly perform the calculation on the previous slide, you are doing better than
most medical doctors! See Gird Gigerenzer and Adrian Edwards, Simple tools for understanding risks:
from innumeracy to insight, British Medical Journal, 327(7417), 741–744, 27 September 2003; Gird
Gigerenzer, Reckoning with risk: Learning to live with uncertainty, Penguin, 2002.

Moral of the story — if you get sick, don’t delegate conditional probability computations to your doctor!)

24 / 34

Bayes’ Theorem

We have implicitly used the following (at first glance remarkable) fact:

Bayes’ Theorem:

p(Z |X ) =
p(Z ,X )

p(X )

=
p(X ,Z )

p(X )

=
p(X |Z )p(Z )

p(X )

=
p(X |Z )p(Z )∑

Z ′ p(X |Z ′)p(Z ′)

If we can express what knowledge of X (test) tells us about Z (disease),
then we can express what knowledge of Z tells us about X

25 / 34

The Bayesian Inference Framework

Bayesian Inference
Bayesian inference provides a mathematical framework explaining how to
change our (prior) beliefs in the light of new evidence.

p(Z |X )︸ ︷︷ ︸
posterior

=

likelihood︷ ︸︸ ︷
p(X |Z )×

prior︷︸︸︷
p(Z )

p(X )︸ ︷︷ ︸
evidence

Prior: Belief that someone is sick

Likelihood: Probability of testing positive given you are sick

Posterior: Probability of being sick given you test positive

26 / 34

Posterior Inference:
Example 2 (Bishop, 2006)

Recall our fruit-box example:

The proportion of oranges and apples are given by

Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.

A piece of fruit has been picked up and it turned out to be an orange.

What is the probability that it came from the red box?

27 / 34

Posterior Inference:
Example 2 (Bishop, 2006)

Recall our fruit-box example:

The proportion of oranges and apples are given by

Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.

A piece of fruit has been picked up and it turned out to be an orange.

What is the probability that it came from the red box?

27 / 34

Posterior Inference:
Example 2 (Bishop, 2006)

Recall our fruit-box example:

The proportion of oranges and apples are given by

Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.

A piece of fruit has been picked up and it turned out to be an orange.

What is the probability that it came from the red box?

27 / 34

Posterior Inference:
Example 2 (Bishop, 2006)

Recall our fruit-box example:

The proportion of oranges and apples are given by

Someone told us that in a previous experiment they ended up picking
up the red box 40% of the time and the blue box 60% of the time.

A piece of fruit has been picked up and it turned out to be an orange.

What is the probability that it came from the red box?

27 / 34

Posterior Inference:
Example 2: Formalization

Let B ∈ {r , b} denote the selected box and F ∈ {a, o} the selected fruit.

p(B = r) = 4/10 p(B = b) = 6/10

p(F = a|B = r) = 1/4 p(F = o|B = r) = 3/4
p(F = a|B = b) = 3/4 p(F = o|B = b) = 1/4

We need to compute p(B = r |F = o), the probability that a picked up
orange came from the red box.

28 / 34

Posterior Inference:
Example 2: Formalization

Let B ∈ {r , b} denote the selected box and F ∈ {a, o} the selected fruit.

p(B = r) = 4/10 p(B = b) = 6/10

p(F = a|B = r) = 1/4 p(F = o|B = r) = 3/4
p(F = a|B = b) = 3/4 p(F = o|B = b) = 1/4

We need to compute p(B = r |F = o), the probability that a picked up
orange came from the red box.

28 / 34

Posterior Inference:
Example 2: Formalization

Let B ∈ {r , b} denote the selected box and F ∈ {a, o} the selected fruit.

p(B = r) = 4/10 p(B = b) = 6/10

p(F = a|B = r) = 1/4 p(F = o|B = r) = 3/4
p(F = a|B = b) = 3/4 p(F = o|B = b) = 1/4

We need to compute p(B = r |F = o), the probability that a picked up
orange came from the red box.

28 / 34

Posterior Inference:
Example 2: Solution

We simply use Bayes’ rule:

p(B = r |F = o) =
p(F = o|B = r)p(B = r)

p(F = o)

=
p(F = o|B = r)p(B = r)

p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)

=
2
3

and therefore p(B = b|F = o) = 1/3.

29 / 34

Posterior Inference:
Example 2: Solution

We simply use Bayes’ rule:

p(B = r |F = o) =
p(F = o|B = r)p(B = r)

p(F = o)

=
p(F = o|B = r)p(B = r)

p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)

=
2
3

and therefore p(B = b|F = o) = 1/3.

29 / 34

Posterior Inference:
Example 2: Solution

We simply use Bayes’ rule:

p(B = r |F = o) =
p(F = o|B = r)p(B = r)

p(F = o)

=
p(F = o|B = r)p(B = r)

p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)

=
2
3

and therefore p(B = b|F = o) = 1/3.

29 / 34

Posterior Inference:
Example 2: Solution

We simply use Bayes’ rule:

p(B = r |F = o) =
p(F = o|B = r)p(B = r)

p(F = o)

=
p(F = o|B = r)p(B = r)

p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)

=
2
3

and therefore p(B = b|F = o) = 1/3.

29 / 34

Posterior Inference:
Example 2: Solution

We simply use Bayes’ rule:

p(B = r |F = o) =
p(F = o|B = r)p(B = r)

p(F = o)

=
p(F = o|B = r)p(B = r)

p(F = o|B = r)p(B = r) + p(F = o|B = b)p(B = b)

=
2
3

and therefore p(B = b|F = o) = 1/3.

29 / 34

Posterior Inference:
Example 2: Interpretation of the Solution

If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box

I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one

I Because the red box contains more oranges than the blue box

In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it

I So after picking up the orange the red box is much more likely to have
been selected than the blue one

30 / 34

Posterior Inference:
Example 2: Interpretation of the Solution

If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box

I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10

Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one

I Because the red box contains more oranges than the blue box

In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it

I So after picking up the orange the red box is much more likely to have
been selected than the blue one

30 / 34

Posterior Inference:
Example 2: Interpretation of the Solution

If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box

I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one

I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it

I So after picking up the orange the red box is much more likely to have
been selected than the blue one

30 / 34

Posterior Inference:
Example 2: Interpretation of the Solution

If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box

I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one

I Because the red box contains more oranges than the blue box

In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it

I So after picking up the orange the red box is much more likely to have
been selected than the blue one

30 / 34

Posterior Inference:
Example 2: Interpretation of the Solution

If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box

I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one

I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it

I So after picking up the orange the red box is much more likely to have
been selected than the blue one

30 / 34

Posterior Inference:
Example 2: Interpretation of the Solution

If we hadn’t been told any information about the fruit picked, the blue
box is more likely to be selected than the red box

I A priori we have p(B = r) = 4/10 and p(B = b) = 6/10
Once we get new information that an orange has been picked, this
increases the probability of the selected box being the red one

I Because the red box contains more oranges than the blue box
In fact, the proportion of oranges is so much higher in the red box that
this is strong evidence that the orange came from it

I So after picking up the orange the red box is much more likely to have
been selected than the blue one

30 / 34

1 More on Joint, Marginal and Conditional Distributions

2 Statistical Independence

3 Bayes’ Theorem

4 Wrapping up

31 / 34

Summary

Recap on joint, marginal and conditional distributions

Interpretation of conditional probability

Statistical Independence

Bayes rule: combination of prior, likelihood to get a posterior

Reading: Mackay § 2.1, § 2.2 and § 2.3

32 / 34

Homework Exercise

Suppose we know that random variables X ,Y satisfy

p(X |Y ) = p(Y |X )

What can you conclude about the relationship between X and Y?

If X and Y are independent, does that imply p(X |Y ) = p(Y |X )?

Repeat the above questions for the statement

p(X |Y )
p(Y |X )

=
p(X )
p(Y )

33 / 34

Next time

More examples on Bayes’ theorem:
I Eating hamburgers

I Detecting terrorists

I The Monty Hall problem

I Document modelling

Are there notions of probability beyond frequency counting?

34 / 34

More on Joint, Marginal and Conditional Distributions
Statistical Independence
Bayes’ Theorem
Wrapping up