COMP111: Artificial Intelligence – Section 8. Reasoning under Uncertainty 1
COMP111: Artificial Intelligence
Section 8. Reasoning under Uncertainty 1
Frank Wolter
Content
I Why reasoning under uncertainty?
I Basics of probability
I Conditional probability
I Independence
I Bayes’ Theorem and basic applications.
Why Reasoning under Uncertainty?
Logic-based KR&R methods mostly assume that knowledge is
certain. Often, this is not the case (or it is impossible to list all
assumptions that make it certain).
I When going to the airport by car, how early should I start? 45
minutes should be enough from Liverpool to Manchester
Airport, but only under the assumption that there are no
accidents, no lane closures, that my car does not break down,
and so on. This uncertainty is hard to eliminate, but still an
agent has to make a decision.
I A dental patient has toothache. Does the patient have a
cavity? One might want to capture the relationship between
patients having a cavity and patients having toothache by the
rule:
Toothache(x)→ Cavity(x)
But this does not work as not every toothache is caused by a
cavity. Hard to come up with exhaustive list of reasons:
Toothache(x)→ Cavity(x)∨GumProblem(x)∨Abscess(x)∨· · ·
Uncertainty
Trying to use exact rules to cope with a domain like medical
diagnosis or traffic fails for three main reasons:
I Laziness: it is too much work to list an exceptionless set of
rules.
I Theoretical ignorance: Medical science has, in many cases, no
strict laws connecting symptoms with diseases.
I Practical ignorance: Even if we have strict laws, we might be
uncertain about a particular patient because not all the
necessary tests have been or can be run.
Probability as Summary
I Probability provides a way of summarizing the uncertainty
that comes from our laziness and ignorance.
I We might not know for sure what disease a particular patient
has, but we believe that there is, say, an 80% chance that a
patient with toothache has a cavity.
The 80% summarises those cases in which all the factors
needed for a cavity to cause a toothache are present and other
cases in which the patient has both toothache and cavity but
the two are unconnected.
I The missing 20% summarizes all the other possible causes of
toothache that we are too lazy or ignorant to confirm or deny.
Discrete Probability
We represent random experiments using discrete probability spaces
(S ,P) consisting of:
I the sample space S of all elementary events x ∈ S ; members
of S are also called outcomes of the experiment.
I a probability distribution P assigning a real number P(x) to
every elementary event x ∈ S such that
I for every x ∈ S : 0 ≤ P(x) ≤ 1;
I and
∑
x∈S
P(x) = 1.
Recall that if S consists of x1, . . . , xn, then∑
x∈S
P(x) = P(x1) + · · ·+ P(xn).
Example: Flipping a fair coin
Consider the random experiment of flipping a coin. Then the
corresponding probability space (S ,P) is given by
I S = {H,T};
I P(H) = P(T ) = 1
2
.
Consider the random experiment of flipping a coin two times, one
after the other. Then the corresponding probability space (S ,P) is
given as follows:
I S = {HH,HT ,TH,TT};
I P(HH) = P(HT ) = P(TH) = P(TT ) = 1
4
.
Example: Rolling a fair die
Consider the random experiment of rolling a die. Then the
corresponding probability space (S ,P) is given by
I S = {1, 2, 3, 4, 5, 6};
I For every x ∈ S : P(x) = 1
6
.
Consider the random experiment of rolling a die n times. Then the
corresponding probability space (S ,P) is given as follows:
I S is the set of sequences of length n over the alphabet
{1, . . . , 6} (sometimes denoted {1, . . . , 6}n).
I P(x) = 1
6n
for every elementary event x , since S has 6n
elements.
Uniform Probability Distributions
I A probability distribution is uniform if every outcome is
equally likely.
I For uniform probability distributions, the probability of an
outcome x is 1 divided by the number |S | of outcomes in S .
P(x) =
1
|S |
Events
E
S
I An event is a subset E ⊆ S of the sample space S .
I The probability of the event E is given by
P(E ) =
∑
x∈E
P(x)
Notice that
I 0 ≤ P(E ) ≤ 1 for every event E ,
I P(∅) = 0 and P(S) = 1.
Example
If I roll a die three times, the event E of rolling at least one 6 is
given by
I the set of sequences of length 3 over {1, . . . , 6} containing at
least one 6.
I P(E ) is the number of sequences containing at least one 6
divided by 6× 6× 6 = 216.
If we roll a fair die, then the event E of rolling an odd number is
given by
I the set E = {1, 3, 5}
I P(E ) = P(1) + P(3) + P(5) = 1
6
+ 1
6
+ 1
6
= 1
2
The probability of composed events
We show how the probability of
I the complement of an event can be computed from the
probability of the event;
I the union of events can be computed from the probabilities of
the individual events.
The probability of the complement of an event
Let ¬E = S \ E . Then P(¬E ) = 1− P(E )
E
S
Side remark:
S = ¬E ∪ E
Proof.
1 =
∑
x∈S
P(x) =
∑
x∈E
P(x) +
∑
x∈¬E
P(x)
Thus, ∑
x∈¬E
P(x) = 1−
∑
x∈E
P(x)
Example
What is the probability that at least one bit in a randomly
generated sequence of 10 bits is 0?
I S = {0, 1}10 = all sequences of 0 and 1 of length 10.
I For every x ∈ S , P(x) = (1
2
)10 = 1
210
.
I E = all sequences of 0 and 1 of length 10 containing at least
one 0.
I ¬E = {1111111111}.
I P(¬E ) = 1
210
.
I P(E ) = 1− 1
210
.
The probability of the union of two events
P(E1 ∪ E2) = P(E1) + P(E2)− P(E1 ∩ E2)
E1 E2
S
Side remark:
|E1 ∪ E2| = |E1|+ |E2| − |E1 ∩ E2|
Proof of P(E1 ∪ E2) = P(E1) + P(E2)− P(E1 ∩ E2)
Recall:
I P(E1) =
∑
x∈E1
P(x);
I P(E2) =
∑
x∈E2
P(x);
I P(E1 ∪ E2) =
∑
x∈E1∪E2
P(x)
Thus,
P(E1 ∪ E2) =
∑
x∈E1∪E2
P(x)
=
∑
x∈E1
P(x) +
∑
x∈E2
P(x)−
∑
x∈E1∩E2
P(x)
= P(E1) + P(E2)− P(E1 ∩ E2)
Example
Suppose I have a jar of 30 sweets, as follows.
red blue green
circular 2 4 3
square 6 7 8
The sample space S has 30 elements and if one chooses a sweet
uniformly at random, then
P(x) =
1
30
for all x ∈ S . What is the probability of choosing a red or circular
sweet?
I The probability that it is red is 2+6
30
= 8
30
(P(R) = 8
30
).
I The probability that it is circular is 2+4+3
30
= 9
30
(P(C ) = 9
30
).
Then P(R ∪ C ) is is the probability that the sweet is red or
circular. Thus
P(R ∪C ) = P(R) + P(C )−P(R ∩C ) =
8
30
+
9
30
−
2
30
=
15
30
=
1
2
Disjoint events
Assume that E1, . . . ,En are mutually disjoint events. So
Ei ∩ Ej = ∅ whenever i 6= j .
Then
P(
⋃
1≤i≤n
Ei ) =
∑
1≤i≤n
P(Ei )
Example: three dice
Suppose that I roll a fair die three times. Then
I S is the set of sequences of length three over {1, . . . , 6} (or
{1, . . . , 6}3).
I P(x) = 1
6×6×6 =
1
216
for all x ∈ S .
What is the probability that I roll at least one 6?
Let E1: event that 1st roll is a 6, E2: event that 2nd roll is a 6; E3:
event that 3rd roll is a 6.
We would like to know
P(E1 ∪ E2 ∪ E3)
Computing the probability of a union of three sets
P(A ∪ B ∪ C ) = P(A) + P(B) + P(C )
− P(A ∩ B)− P(A ∩ C )− P(B ∩ C )
+ P(A ∩ B ∩ C ).
A B
C
|A ∪ B ∪ C | = |A|+ |B|+ |C |
−|A ∩ B| − |A ∩ C | − |B ∩ C |
+|A ∩ B ∩ C |.
Three dice example continued
E1 E2
E3
E1: event that 1st roll is a 6.
E2: event that 2nd roll is a 6.
E3: event that 3rd roll is a 6.
P(E1 ∪ E2 ∪ E3) = P(E1) + P(E2) + P(E3)
− P(E1 ∩ E2)− P(E1 ∩ E3)− P(E2 ∩ E3)
+ P(E1 ∩ E2 ∩ E3)
= 36
216
+ 36
216
+ 36
216
− 6
216
− 6
216
− 6
216
+ 1
216
= 91
216
≈ 0.42
Conditional Probability
I Often, we are interested in just part of the sample space.
I Conditional probability gives us a means of handling this
situation.
I Consider a family chosen at random from a set of families
having two children (but not having twins).
I What is the probability that both children are boys?
I A suitable sample space S = {BB,GB,BG ,GG}.
I It is reasonable to assume that P(x) = 1
4
for all x ∈ S . Thus
P(BB) = 1
4
.
Conditional Probability
I Now you learn that the families were selected from those who
have one child at a boys’ school. Does this change
probabilities?
I The new sample space (denoted S ′) is
S ′ = {BB,GB,BG}
and we are now looking for
P(BB | at least one boy ) = P(BB | S ′)
where the vertical line is read “given that”.
I How do we assign probabilities to the events in S ′?
Normalisation
I S ′ is a subset of S , so every outcome x in S ′ is also in S . Its
probability P(x) in S we can determine.
I However, if we just take the sum of these probabilities, they
will sum to less than 1. (In the example
P(BB) + P(GB) + P(BG ) = 3
4
.) This violates our
assumptions for probability spaces.
I We therefore normalise by dividing the probability P(x) of the
outcome x in S by the probability P(S ′) of S ′ in S . Thus
P(BB | at least one boy ) = P(BB | S ′) =
P(BB)
P(S ′)
=
1
4
3
4
=
1
3
Conditioning
I Assume now that events A and B are given.
I Assume we know that B happens. So we want to condition on
B. Thus, we want to know
P(A | B),
the probability that A occurs given that B is known to occur.
I So we want to know the probability P(A ∩ B) (since we know
that B occurs) after the conditioning on B.
I Once again, we can’t take P(A ∩ B) itself but have to
normalise by dividing by the probability of the new sample
space P(B). Thus
P(A | B) =
P(A ∩ B)
P(B)
Conditional Probability
Let A and B be events, with P(B) > 0.
The conditional probability P(A|B) of A given B is given by
P(A|B) =
P(A ∩ B)
P(B)
Example
I choose a sweet uniformly at random from a jar of 30 sweets.
red blue green
circular 2 4 3
square 6 7 8
I The probability that it is red conditioned on that it is circular
is
2
30
÷ 2+4+3
30
= 2
30
× 30
2+4+3
= 2
2+4+3
= 2
9
, which is
less than the unconditional probability that it is red, 2+6
30
= 8
30
.
I The probability that it is circular conditioned on the event
that it is red is
2
30
÷ 2+6
30
= 2
6+2
= 1
4
Multiplication Rule
I The conditional probability P(A|B) of A given B is given by
P(A|B) =
P(A ∩ B)
P(B)
I We can rewrite this as
P(A∩B) = P(A | B)P(B) ( also: P(A ∩ B) = P(B | A)P(A))
This is knows as the multiplication rule.
I It can be extended to more events, for example:
P(A∩B∩C ) = P(C | A∩B)P(A∩B) = P(C | A∩B)P(B | A)P(A)
An Application of the Multiplication Rule
I Consider choosing a family from a set of families with just one
pair of twins (and thus no other children).
I What is the probability P(BB) that both twins are boys?
I Recall that twins are either identical (I ) or fraternal (F ). We
know that a third of human twins are identical. Thus
P(I ) =
1
3
, P(F ) =
2
3
and
P(BB) = P(I ∩ BB) + P(F ∩ BB)
I By multiplication rule
P(I ∩ BB) = P(BB | I )P(I ), P(F ∩ BB) = P(BB | F )P(F )
An Application of the Multiplication Rule
I The probability of being a girl or boy for fraternal twins will be
the same as for any other two-child family. For the identical
twins, the outcomes BG and GB are no longer possible. Thus:
P(BB | I ) =
1
2
, P(BB | F ) =
1
4
I We obtain:
P(BB) = P(I ∩ BB) + P(F ∩ BB)
= P(BB | I )P(I ) + P(BB | F )P(F )
=
1
2
×
1
3
+
1
4
×
2
3
=
1
3
Independence
I In everyday language we refer to events that “have nothing to
do with each other” as being independent.
I A similar notion of independence is useful in probability theory
as it helps to structure probabilistic knowledge and reduce
complexity.
Events A and B are independent if
P(A ∩ B) = P(A)× P(B)
If P(A) 6= 0 and P(B) 6= 0, then the following are equivalent:
I A and B are independent;
I P(B) = P(B|A); (recall that P(B | A) = P(A∩B)
P(A)
);
I P(A) = P(A|B); (recall that P(A | B) = P(A∩B)
P(B)
).
Example 1: Independence
If you roll two dice, then the rolls are independent. So the event A
of rolling a 1 with the first die is independent of the event B of
rolling a 1 with the second die. So the probability that both of
these happen is
P(A ∩ B) = P(A)× P(B) = 1
6
× 1
6
= 1
36
This holds for all ab ∈ {1, 2, 3, 4, 5, 6}2:
P(ab) =
1
36
= P(a)× P(b)
Example continued
I Let A be the event of rolling a 1 with the first die.
I Let C be the event of rolling a 1 with the first or second die.
I Let B be the event of rolling an even sum.
We show that A and B are independent but C and B are not.
I As
A = {11, 12, 13, 14, 15, 16}.
we have P(A) = 6
36
= 1
6
.
I As
B = {11, 13, 15, 22, 24, 26, 31, 33, 35, 42, 44, 46, 51, 53, 55, 62, 64, 66}
we have P(B) = 18
36
= 1
2
.
I We have A ∩ B = {11, 13, 15}. So
P(A ∩ B) = 3
36
= 1
12
= P(A)P(B)
and conclude that A and B are independent.
Example continued
I As
C = {11, 12, 13, 14, 15, 16, 21, 31, 41, 51, 61}
we have P(C ) = 11
36
.
I As
B ∩ C = {11, 13, 15, 31, 51}
we have P(B ∩ C ) = 5
36
.
I So
P(B ∩ C ) = 5
36
6=
5.5
36
= P(B)P(C )
and we conclude that B and C are not independent.
Example 2
I Consider the random experiment of drawing two balls, one
after the other, from a basket containing a red, a blue, and a
green ball.
I Let R1 be the event that the first ball is red.
I Let R2 be the event that the second ball is blue.
We show that R1 and R2 are not independent.
Let
I S = {RB,RG ,BR,BG ,GR,GB};
I P(RB) = P(RG ) = P(BR) = P(BG ) = P(GR) = P(GB) =
1
6
.
I Then R1 = {RB,RG} and R2 = {RB,GB}.
I P(R1) = P(RB) + P(RG ) =
1
3
and
P(R2) = P(RB) + P(GB) =
1
3
.
I Thus,
P(R1 ∩ R2) = P(RB) =
1
6
6= P(R1)× P(R2) =
1
9
Independence for more than two events
I Consider a finite set of events
A1, . . . ,An
I A1, . . . ,An are pairwise independent if every pair of events is
independent: for all distinct k ,m
P(Am ∩ Ak) = P(Am)P(Ak)
I A1, . . . ,An are mutually independent if every event is
independent of any intersection of the other events: for all
distinct k1, . . . , km:
P(Ak1)× · · · × P(Akm) = P(Ak1 ∩ · · · ∩ Akm)
I Pairwise independence is not a particularly important notion.
Pairwise independence does not imply mutual
independence
I Consider the random experiment of flipping a coin two times,
one after the other. Then S = {HH,HT ,TH,TT} and
P(HH) = P(HT ) = P(TH) = P(TT ) = 1
4
.
I Let H1 = {HT ,HH}. Then P(H1) = 12 ;
I Let H2 = {HH,TH}. Then P(H2) = 12 ;
I Let H∗ = {HH,TT}. Then P(H∗) = 1
2
.
I H1,H2,H
∗ are pairwise independent:
H1 ∩ H2 = H1 ∩ H∗ = H2 ∩ H∗ = {HH}, P(HH) =
1
4
I H1,H2,H
∗ are not mutually independent:
P(H1 ∩ H2 ∩ H∗) =
1
4
6=
1
8
= P(H1)P(H2)P(H
∗)
Example
I Consider the random experiment of rolling a die n times.
Recall that probability space (S ,P) is given as follows:
I S is the set of sequences of length n over the alphabet
{1, . . . , 6} (sometimes denoted {1, . . . , 6}n).
I P(x) = 1
6n
since S has 6n elements.
I Let Ei be the event of getting a 6 the ith time the die is
rolled, for 1 ≤ i ≤ n.
I Then E1, . . . ,En are mutually independent.
Bayes’ Theorem (first form)
If P(A) > 0, then
P(B|A) =
P(A|B)× P(B)
P(A)
Proof. We have
I P(A ∩ B) = P(A|B)× P(B) and
I P(A ∩ B) = P(B|A)× P(A).
Thus,
P(A|B)× P(B) = P(B|A)× P(A)
By dividing by P(A) we get
P(B|A) =
P(A|B)× P(B)
P(A)
Application: Diagnosis
Assume a patient walks into a doctor’s office complaining of a stiff
neck. The doctor knows:
I meningitis may cause a patient to have a stiff neck 50% of the
time (causal knowledge);
I the probability of having meningitis if 1
50000
;
I the probability of having a stiff neck is 1
20
.
Problem: What is the probability that the patient has meningitis?
Let A be the event that the patient has a stiff neck and B the
event that he has meningitis. Then
P(B | A) =
P(A | B)P(B)
P(A)
=
1/2× 1/50000
1/20
=
1
5000
Comments
I We can interpret the fact that the patient has a stiff neck as a
new observation.
I Given this observation, we want to classify the patient as
either having meningitis or not having meningitis.
I We have prior knowledge about the unconditional probability
of having a stiff neck and having meningitis.
I We have causal knowledge about the number of times in
which meningitis causes a stiff neck.
I We can then compute the diagnostic probabilities using
P(B | A) =
P(A | B)P(B)
P(A)
Bayes’ Theorem (alternative form)
If P(A) > 0, then
P(B|A) =
P(A|B)× P(B)
P(A|B)× P(B) + P(A|¬B)× P(¬B)
It suffices to show
P(A) = P(A|B)× P(B) + P(A|¬B)× P(¬B)
But this follows from
P(A) = P((A ∩ B) ∪ (A ∩ ¬B))
= P(A ∩ B) + P(A ∩ ¬B)
= P(A|B)× P(B) + P(A|¬B)× P(¬B)
Application: Diagnosis
Assume a drug test is
I positive for users 99% of the time;
I negative for non-users 99% of the time.
Assume that 0.5% take the drug.
Problem: What is the probability that a person whose test is
positive (event A) takes the drug (event B)?
We have P(A|B) = 99
100
, P(¬A|¬B) = 99
100
, and P(B) = 1
200
.
Thus, P(A|¬B) = 1
100
and P(¬B) = 199
200
.
Thus,
P(B|A) =
P(A|B)× P(B)
P(A|B)× P(B) + P(A|¬B)× P(¬B)
= 0.33