COMP2610 / COMP6261 – Information Theory Lecture 3: Probability Theory and Bayes’ Rule
U Logo Use Guidelines Robert C. Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:
Copyright By PowCoder代写 加微信 powcoder
earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
July 30, 2018
A general communication system
Why do we need probability?
Basics of probability theory
Joint, marginal and conditional distributions
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) 220/1000 p(B = 1)
p(B = 1|A = 1)
p(A = 0|B = 0)
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) 220/1000 p(B = 1) 100/1000 p(A = 0)
p(B = 1|A = 1)
p(A = 0|B = 0)
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 100/1000 690/1000
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 100/1000 690/1000 90/310
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 100/1000 690/1000 90/310 680/900
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 Joint 100/1000 690/1000 90/310
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 Joint 100/1000 Marginal 690/1000
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 100/1000 690/1000 90/310 680/900
Joint Marginal Marginal
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 100/1000 690/1000 90/310 680/900
Joint Marginal Marginal Conditional
Review Exercise
Suppose I go through the records for N = 1000 students, checking their admission status, A = {0, 1}, and whether they are “brilliant” or not,
B = {0, 1}
(Aside: “Brilliance” is a dodgy concept, and does not predict scientific achievement as well as persistence and combinatorial ability; see e.g. , Scientific Genius: A Psychology of Science, Cambridge University Press, 2009; this is just a toy example!)
Say that the counts for admission and brilliance are
680 10 220 90
p(A = 1, B = 0) p(B = 1)
p(B = 1|A = 1) p(A = 0|B = 0)
220/1000 100/1000 690/1000 90/310 680/900
Joint Marginal Marginal Conditional Conditional
More on joint, marginal and conditional distributions
When can we say that X , Y do not influence each other?
What, if anything, does p(X = x|Y = y) tell us about p(Y = y|X = x)?
More on joint, marginal and conditional distributions
When can we say that X , Y do not influence each other?
What, if anything, does p(X = x|Y = y) tell us about p(Y = y|X = x)?
Philosophically related to “How do we know / learn about the world?”
More on joint, marginal and conditional distributions
When can we say that X , Y do not influence each other?
What, if anything, does p(X = x|Y = y) tell us about p(Y = y|X = x)?
Philosophically related to “How do we know / learn about the world?”
I am not providing a general answer; but keep it in mind!
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
Document Modelling Example
Suppose we have a large document of English text, represented as a sequence of characters:
x1x2x3 …xN e.g. hello how are you
Treat each consecutive pair of characters as the outcome of “random variables” X , Y , i.e.
X = ‘h’, Y = ‘e’ X = ‘e’, Y = ‘l’ X = ‘l’, Y = ‘l’ .
Document Modelling: Marginal and Joint Distributions
Unigram / Monogram
Marginal and joint distributions for English alphabet, estimated from the “FAQ manual for Linux”. Figure from Mackay (ITILA, 2003); areas of squares proportional to probability (the right way to do it!).
Document Modelling: Conditional Distributions
Conditional distributions for English alphabet, estimated from the “FAQ manual for Linux”. Are these distributions “symmetric”? Figure from Mackay (ITILA, 2003)
P(X =x|Y =y)=P(Y =y|X =x)?P(X =x|Y =y)=P(X =y|Y =x)?.
Recap: Sum and Product Rules
p(X =xi)=p(X =xi,Y =yj) j
Product rule:
p(X =xi,Y =yj)=p(Y =yj|X =xi)p(X =xi)
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)?
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)? Yes.
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x,y. Couldwecomputep(X =x,Y =y)orp(X =x|Y =y)?
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x,y. Couldwecomputep(X =x,Y =y)orp(X =x|Y =y)? No.
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x,y. Couldwecomputep(X =x,Y =y)orp(X =x|Y =y)? No.
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x,y. Couldwecomputep(X =x,Y =y)orp(X =x|Y =y)? No.
The difference in answers above is of great significance
Relating the Marginal, Conditional and Joint
Suppose we knew p(X = x,Y = y) for all values of x,y. Could we computeallofp(X =x|Y =y),p(X =x)andp(Y =y)? Yes.
Now suppose we knew p(X = x) and p(Y = y) for all values of x,y. Couldwecomputep(X =x,Y =y)orp(X =x|Y =y)? No.
The difference in answers above is of great significance
680 10 220 90
640 50 260 50
These have the same marginals, but different joint distributions
Joint as the “Master” Distribution
In general, there can be many consistent joint distributions for a given set of marginal distributions
The joint distribution is the “master” source of information about the dependence
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
Recall: Fruit-Box Experiment
Statistical Independence
Suppose that both boxes (red and blue) contain the same proportion of apples and oranges.
If fruit is selected uniformly at random from each box:
p(F =a|B =r)=p(F =a|B =b) (=p(F =a)) p(F =o|B =r)=p(F =o|B =b) (=p(F =o))
Statistical Independence
Suppose that both boxes (red and blue) contain the same proportion of apples and oranges.
If fruit is selected uniformly at random from each box:
p(F =a|B =r)=p(F =a|B =b) (=p(F =a)) p(F =o|B =r)=p(F =o|B =b) (=p(F =o))
The probability of selecting an apple (or an orange) is independent of the box that is chosen.
We may study the properties of F and B separately: this often simplifies analysis
Statistical Independence: Definition
Definition: Independent Variables
Two variables X and Y are statistically independent, denoted X ⊥ Y , if and only if their joint distribution factorizes into the product of their marginals:
X ⊥ Y ↔ p(X,Y) = p(X)p(Y) This definition generalises to more than two variables.
Statistical Independence: Definition
Definition: Independent Variables
Two variables X and Y are statistically independent, denoted X ⊥ Y , if and only if their joint distribution factorizes into the product of their marginals:
X ⊥ Y ↔ p(X,Y) = p(X)p(Y)
This definition generalises to more than two variables.
Are the variables in the language example statistically independent?
A Note on Notation
When we write
p(X,Y) = p(X)p(Y)
we have not specified the outcomes of X , Y explicitly
This statement is a shorthand for
p(X =x,Y =y)=p(X =x)p(Y =y) for every possible x and y
This notation is sometimes called implied universality
Conditional independence
We may also consider random variables that are conditionally independent
given some other variable
Definition: Conditionally Independent Variables
Two variables X and Y are conditionally independent given Z , denoted X ⊥ Y|Z, if and only if
p(X,Y|Z) = p(X|Z)p(Y|Z) Intuitively, Z is a common cause for X and Y
Example: X = whether I have a cold Y = whether I have a headache
Z = whether I have the flu
1 More on Joint, Marginal and Conditional Distributions
2 Statistical Independence
3 Bayes’ Theorem
4 Wrapping up
Revisiting the Product Rule
The product rule tells us:
p(X,Y) = p(Y|X)p(X)
This can equivalently be interpreted as a definition of conditional
probability:
p(Y|X) = p(X,Y) p(X)
Can we use these to relate p(X |Y ) and p(Y |X )?
Posterior Inference: Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
Only 1% people of Dicksy’s background have the disease
Posterior Inference: Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
Posterior Inference: Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
It correctly identifies a healthy individual 96% of the time p(identifies healthy | healthy) = 96%.
Posterior Inference: Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%. Dicksy has tested positive (apparently sick)
Posterior Inference: Example 1 (Mackay, 2003)
Dicksy Sick had a test for a rare disease
Only 1% people of Dicksy’s background have the disease
The test simply classifies a person as having the disease, or not
The test is reliable, but not infallible
It correctly identifies a sick individual 95% of the time
p(identifies sick | sick) = 95%.
It correctly identifies a healthy individual 96% of the time
p(identifies healthy | healthy) = 96%. Dicksy has tested positive (apparently sick)
What is the probability of Dicksy having the disease?
Posterior Inference: Example 1: Formalization
Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the outcome of the test:
Posterior Inference: Example 1: Formalization
Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the outcome of the test:
p(D =1)=0.01 p(T =1|D =1)=0.95 p(T =0|D =1)=0.05
p(D =0)=0.99 p(T =1|D =0)=0.04 p(T =0|D =0)=0.96
Posterior Inference: Example 1: Formalization
Let D ∈ {0, 1} denote whether Dicksy has the disease, and T ∈ {0, 1} the outcome of the test:
p(D =1)=0.01 p(T =1|D =1)=0.95 p(T =0|D =1)=0.05
p(D =0)=0.99 p(T =1|D =0)=0.04 p(T =0|D =0)=0.96
We need to compute p(D = 1|T = 1), the probability of Dicksy having the disease given that the test has resulted positive.
Posterior Inference: Example 1: Solution
p(D = 1, T = 1)
p(D = 1|T = 1) = p(T = 1) Def. conditional prob.
Posterior Inference: Example 1: Solution
p(D = 1|T = 1) = =
p(D = 1, T = 1) p(T = 1) p(T = 1, D = 1) p(T = 1)
Def. conditional prob.
Posterior Inference: Example 1: Solution
p(D = 1|T = 1) = = =
p(D = 1, T = 1) p(T = 1) p(T = 1, D = 1) p(T = 1)
Def. conditional prob.
p(T =1|D =1)p(D =1)
p(T = 1) Product rule
Posterior Inference: Example 1: Solution
p(D = 1|T = 1) = = =
p(D = 1, T = 1) p(T = 1) p(T = 1, D = 1) p(T = 1)
Def. conditional prob.
p(T =1|D =1)p(D =1)
p(T = 1) Product rule
p(T =1|D =1)p(D =1) =dp(T=1|D=d)p(D=d) Sumrule
Posterior Inference: Example 1: Solution
p(D = 1|T = 1) = = =
p(D = 1, T = 1) p(T = 1) p(T = 1, D = 1) p(T = 1)
Def. conditional prob.
p(T =1|D =1)p(D =1)
p(T = 1) Product rule
p(T =1|D =1)p(D =1) =dp(T=1|D=d)p(D=d) Sumrule = p(T =1|D =1)p(D =1)
p(T =1|D=1)p(D=1)+p(T =1|D=0)p(D=0)
Posterior Inference: Example 1: Solution
p(D = 1|T = 1) = = =
p(D = 1, T = 1) p(T = 1) p(T = 1, D = 1) p(T = 1)
Def. conditional prob.
p(T =1|D =1)p(D =1)
p(T = 1) Product rule
p(T =1|D =1)p(D =1) =dp(T=1|D=d)p(D=d) Sumrule = p(T =1|D =1)p(D =1)
p(T =1|D=1)p(D=1)+p(T =1|D=0)p(D=0) ≈ 0.19.
Posterior Inference: Example 1: Solution
p(D = 1|T = 1) = = =
p(D = 1, T = 1) p(T = 1) p(T = 1, D = 1) p(T = 1)
Def. conditional prob.
p(T =1|D =1)p(D =1)
p(T = 1) Product rule
p(T =1|D =1)p(D =1) =dp(T=1|D=d)p(D=d) Sumrule = p(T =1|D =1)p(D =1)
p(T =1|D=1)p(D=1)+p(T =1|D=0)p(D=0) ≈ 0.19.
Despite testing positive and the high accuracy of the test, the probability of Dicksy having the disease is only 0.19!
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com