5a_Uncertainty.dvi
COMP9414 Uncertainty 1
Reasoning with Uncertainty
� An agent can not always ascertain the truth of all propositions, so
may not only have “flat out” beliefs (P or ¬P)
� Some environments themselves generate uncertainty for the agent, due
to unpredictability or nondeterminism, so propositions inadequately
model those environments
� Rational decisions for an agent require tradeoffs between the
importance of goals and the likelihood of achieving them, and the
cost of acting and not achieving them
UNSW ©W. Wobcke et al. 2019–2021
COMP9414: Artificial Intelligence
Lecture 5a: Uncertainty
Wayne Wobcke
e-mail:w. .au
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 3
Problems with Logical Approach
� Consider trying to formalize a medical diagnosis system
∀p(Symptom(p, AbdominalPain)→ Disease(p, Appendicitis))
� This rule is not correct since patients with abdominal pain may be
suffering from other diseases
∀p(Symptom(p, AbdominalPain)→
Disease(p, Appendicitis)∨Disease(p, Ulcer)∨Disease(p, Indig) · · ·)
� How about a causal rule?
∀p(Disease(p, Ulcer)→ Symptom(p, AbdominalPain))
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 2
This Lecture
� Uncertainty
� Probability Theory
� Conditional Probability and Bayes’ Rule
� Bayesian Networks
◮ Semantics of Bayesian Networks
◮ Inference in Bayesian Networks
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 5
What Do the Numbers Mean?
Statistical/Frequentist View
Long-range frequency of a set of “events” e.g. probability of the event
of “heads” appearing on the toss of a coin = long-range frequency of
heads that appear on coin toss
Objective View
Probabilities are real aspects of the world – objective
Personal/Subjective/Bayesian View
Measure of belief in proposition based on agent’s knowledge, e.g.
probability of heads is a degree of belief that coin will land heads
based on beliefs about the coin or could be just a guess; different
agents may assign a different probability – subjective
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 4
Sources of Uncertainty
� Difficulties arise with the logical approach due to
incompleteness: agent may not have complete theory for domain
ignorance: agent may not have enough information about domain
noise: information agent does have may be unreliable
nondeterminism: environment itself may be stochastic
unpredictability: environment may be inherently unpredictable
� Probability gives a way of summarizing this uncertainty
◮ e.g. agent believes that there is a probability of 0.75 that patient
suffers from appendicitis if they have abdominal pains
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 7
Sample Space and Events
TTT
TTH
HTH
THT
THH
HTT
HHT
HHH
Simple Event
Event
Sample Space
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 6
Sample Space and Events
� Flip a coin three times
� The possible outcomes are
TTT TTH THT THH
HTT HTH HHT HHH
� Set of all possible outcomes
S = {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH}
� Any subset of the sample space is known as an event
� Any singleton subset of the sample space is known as a simple event
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 9
Random Variables
� Propositions are random variables that can take on several values
P(Weather = Sunny) = 0.8
P(Weather = Rain) = 0.1
P(Weather = Cloudy) = 0.09
P(Weather = Snow) = 0.01
� Every random variable X has a domain of possible values
〈x1, x2, · · · ,xn〉
� Probabilities of all possible values P(Weather)= 〈0.8, 0.1, 0.09, 0.01〉
is a probability distribution
� P(Weather, Appendicitis) is a combination of random variables
represented by cross product (can also use logical connectives
P(A∧B) to represent compound events)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 8
Prior Probability
� P(A) is the prior or unconditional probability that an event A occurs
� For example, P(Appendicitis) = 0.3
� In the absence of any other information, agent believes there is a
probability of 0.3 (30%) that the patient suffers from appendicitis
� To account for the effect of new information on probabilities, the agent
must reason with conditional probabilities to update probabilities
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 11
Conditional Probability
� Need to update probabilities based on new information
� Use conditional or posterior probability
P(A|B) is the probability of A given that all we know is B
◮ e.g. P(Appendicitis|AbdominalPain) = 0.75
� Definition: P(A|B) =
P(A∧B)
P(B)
provided P(B)> 0
� Product Rule: P(A∧B) = P(A|B).P(B)
� P(X |Y ) = P(X = xi|Y = y j) for all i, j
P(X ,Y ) = P(X |Y ).P(Y ) – a set of equations
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 10
Axioms of Probability
1. 0 ≤ P(A)≤ 1
� All probabilities are between 0 and 1
2. P(True) = 1 P(False) = 0
� Valid propositions have probability 1
� Unsatisfiable propositions have probability 0
3. P(A∨B) = P(A)+P(B)−P(A∧B)
� Can determine probabilities of all other propositions
� For example, P(A∨¬A) = P(A)+P(¬A)−P(A∧¬A)
P(True) = P(A)+P(¬A)−P(False)
1 = P(A)+P(¬A)−0
Therefore P(¬A) = 1−P(A)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 13
Joint Probability Distribution
� Complete specification of probabilities to all events in domain
� Suppose random variables X1, X2, · · · , Xn
� An atomic (simple) event is an assignment of values to all variables
� Joint probability distribution P(X1, X2, · · · , Xn) assigns probabilities
to all possible atomic events
� Simple medical domain with two Boolean random variables
AbdominalPain ¬AbdominalPain
Appendicitis 0.04 0.06
¬Appendicitis 0.01 0.89
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 12
Normalization
TTT
TTH
HTH
THT
THH
HTT
HHT
HHH
0.1/0.45=2/9
0.15/0.45=3/9
0.1/0.45=2/9
0.1/0.45=2/9
0.0
0.0
0.0
0.0
� Conditional probability distribution given that first coin is H
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 15
Joint Probability Distribution
Assume there is some underlying joint probability distribution over three
random variables toothache, cavity and catch, which we can write in the
form of a table
cavity
L
toothache
cavity
catch catch
L
toothache
L
catch catch
L
.108 .012
.016 .064
.072
.144
.008
.576
Note that the sum of the entries in the table is 1.0
For any proposition, sum the atomic events where it is true
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 14
Joint Probability Distribution
� Simple events are mutually exclusive and jointly exhaustive
� Probability of complex event is sum of probabilities of compatible
simple events
P(Appendicitis) = 0.04+0.06 = 0.10
P(Appendicitis∨AbdominalPain) = 0.04+0.06+0.01 = 0.11
P(Appendicitis|AbdominalPain)
=
P(Appendicitis∧AbdominalPain)
P(AbdominalPain)
= 0.04
0.04+0.01
= 0.8
� Problem: With many variables, the number of probabilities is vast
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 17
Inference by Enumeration Example
cavity
L
toothache
cavity
catch catch
L
toothache
L
catch catch
L
.108 .012
.016 .064
.072
.144
.008
.576
For any proposition, sum the atomic events where it is true
P(cavity∨ toothache)
= 0.108+0.012+0.072+0.008+0.016+0.064 = 0.28
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 16
Inference by Enumeration Example
cavity
L
toothache
cavity
catch catch
L
toothache
L
catch catch
L
.108 .012
.016 .064
.072
.144
.008
.576
For any proposition, sum the atomic events where it is true
P(toothache) = 0.108+0.012+0.016+0.064 = 0.2
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 19
Bayes’ Rule
P(B|A) =
P(A|B)P(B)
P(A)
� AI systems abandon joint probabilities and work directly with
conditional probabilities using Bayes’ Rule
� Deriving Bayes’ Rule:
P(A∧B) = P(A|B)P(B) (Definition)
P(B∧A) = P(B|A)P(A) (Definition)
So P(A|B)P(B) = P(B|A)P(A) since P(A∧B) = P(B∧A)
Hence P(B|A) =
P(A|B)P(B)
P(A)
if P(A) 6= 0
� Note: If P(A) = 0, P(B|A) is undefined
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 18
Conditional Probability by Enumeration
cavity
L
toothache
cavity
catch catch
L
toothache
L
catch catch
L
.108 .012
.016 .064
.072
.144
.008
.576
P(¬cavity|toothache) =
P(¬cavity∧ toothache)
P(toothache)
=
0.016+0.064
0.108+0.012+0.016+0.064
= 0.4
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 21
Normalization
� Avoiding assessment of symptoms
P(Meningitis|StiffNeck) =
P(StiffNeck|Meningitis).P(Meningitis)
P(StiffNeck)
P(¬Meningitis|StiffNeck) =
P(StiffNeck|¬Meningitis).P(¬Meningitis)
P(StiffNeck)
� P(StiffNeck) = P(StiffNeck|Meningitis).P(Meningitis)
+P(StiffNeck|¬Meningitis).P(¬Meningitis)
� So P(Meningitis|StiffNeck) =
P(StiffNeck|Meningitis).P(Meningitis)
P(StiffNeck|Meningitis).P(Meningitis)+P(StiffNeck|¬Meningitis).P(¬Meningitis)
Similarly for P(¬Meningitis|StiffNeck)
� Therefore, from both P(StiffNeck| · · ·) can derive P(StiffNeck)
and the denominator is a normalization factor
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 20
Applying Bayes’ Rule
� Example (Russell & Norvig, 1995)
� Doctor knows that
– meningitis causes a stiff neck 50% of the time
– chance of patient having meningitis is 1
50000
– chance of patient having a stiff neck is 1
20
� P(StiffNeck|Meningitis) = 0.5
P(Meningitis) = 1
50000
P(StiffNeck) = 1
20
P(Meningitis|StiffNeck) =
P(StiffNeck|Meningitis).P(Meningitis)
P(StiffNeck)
= 0.5 1
50000
1
1
20
= 0.0002
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 23
Conditional Independence
� Appendicitis is direct cause of both abdominal pain and nausea
� If we know a patient is suffering from appendicitis, the probability
of nausea should not depend on the presence of abdominal pain;
likewise probability of abdominal pain should not depend on nausea
� Nausea and abdominal pain are conditionally independent given
appendicitis
� An event X is independent of event Y , conditional on background
knowledge K, if knowing Y does not affect the conditional probability
of X given K:
P(X |K) = P(X |Y,K)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 22
Using Bayes’ Rule
� Suppose there are two conditional probabilities for appendicitis
P(Appendicitis|AbdominalPain) = 0.8
P(Appendicitis|Nausea) = 0.1
� P(Appendicitis|AbdominalPain∧Nausea)
=
P(AbdominalPain∧Nausea|Appendicitis).P(Appendicitis)
P(AbdominalPain∧Nausea)
� Need to know P(AbdominalPain∧Nausea|Appendicitis)
� With many symptoms that is a daunting task . . .
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 25
Bayesian Networks
� Example (Pearl, 1988)
� You have a new burglar alarm at home that is quite reliable at
detecting burglars but may also respond at times to an earthquake.
You also have two neighbours, John and Mary, who promise to call
you at work when they hear the alarm. John always calls when he
hears the alarm but sometimes confuses the telephone ringing with
the alarm and calls then, also Mary likes loud music and sometimes
misses the alarm. Given the evidence of who has or has not called,
we would like to estimate the probability of a burglary.
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 24
Bayesian Networks
� A Bayesian network (also Bayesian Belief Network, probabilistic
network, causal network, knowledge map) is a directed acyclic graph
(DAG) where
◮ Each node corresponds to a random variable
◮ Directed links connect pairs of nodes – a directed link from node
X to node Y means that X has a direct influence on Y
◮ Each node has a conditional probability table quantifying effect of
parents on node
� Independence assumption of Bayesian networks
◮ Each random variable is (conditionally) independent of its
nondescendants given its parents
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 27
Conditional Probability Table
� Row contains conditional probability of each node value for a
conditioning case (possible combination of values for parent node)
P(Alarm|Burglary∧Earthquake)
Burglary Earthquake True False
True True 0.950 0.050
True False 0.940 0.060
False True 0.290 0.710
False False 0.001 0.999
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 26
Bayesian Networks
� Example (Pearl, 1988)
Burglary
P(Burglary)
0.001
Earthquake
P(Earthquake)
0.002
Alarm
Burglary
True
True
False
False
Earthquake
True
False
True
Flase
P(Alarm)
0.95
0.94
0.29
0.001
JohnCalls MaryCalls
Alarm
True
False
P(JohnCalls)
0.90
0.05
Alarm
True
False
P(MaryCalls)
0.70
0.01
� Probabilities summarize potentially infinite set of possible circum-
stances
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 29
Semantics of Bayesian Networks
� Factorization of joint probability distribution
� Chain Rule: Use conditional probabilities to decompose conjunctions
P(X1∧X2∧·· ·∧Xn) = P(X1).P(X2|X1).P(X3|X1∧X2). · · · .P(Xn|X1∧
X2 ∧·· ·∧Xn−1)
� Now, order the variables X1, X2, · · · , Xn in a Bayesian network so
that a variable comes after its parents – let πXi be the tuple of parents
of variable Xi (this is a complex random variable)
Using the chain rule, P(X1 ∧X2 ∧ ·· · ∧Xn) = P(X1).P(X2|X1).
P(X3|X1 ∧X2). · · · .P(Xn|X1 ∧X2 ∧·· ·∧Xn−1)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 28
Semantics of Bayesian Networks
� Bayesian network provides a complete description of the domain
� Joint probability distribution can be determined from the network
◮ P(X1, X2, · · · , Xn) = ∏
n
i=1 P(Xi|Parents(Xi))
� For example, P(J ∧M∧A∧¬B∧¬E) =
P(J|A).P(M|A).P(A|¬B∧¬E).P(¬B).P(¬E) =
0.90×0.70×0.001×0.999×0.998 = 0.000628
� Bayesian network is a complete and non-redundant representation
of domain (and can be far more compact than joint probability
distribution)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 31
Calculation using Bayesian Networks
� Fact 1: Consider random variable X with parents Y1, Y2, · · · , Yn
P(X |Y1 ∧·· ·∧Yn ∧Z) = P(X |Y1 ∧·· ·∧Yn)
if Z doesn’t involve a descendant of X (including X itself)
� Fact 2: If Y1, · · · ,Yn are pairwise disjoint and exhaust all possibilities
P(X) = ΣP(X ∧Yi) = ΣP(X |Yi).P(Yi)
P(X |Z) = ΣP(X ∧Yi|Z)
◮ e.g. P(J|B) =
P(J∧B)
P(B)
=
ΣP(J∧B∧e∧a∧m)
ΣP( j∧B∧e∧a∧m)
where j ranges over J,¬J,
e over E,¬E, a over A,¬A and m over M,¬M
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 30
Semantics of Bayesian Networks
� Each P(Xi|X1 ∧X2 ∧ ·· · ∧ Xi−1) has the property that it is not
conditioned on a descendant of Xi (given the ordering of variables in
the Bayesian network)
� Therefore, by conditional independence
◮ P(Xi|X1 ∧X2 ∧·· ·∧Xi−1) = P(Xi|πXi)
� Rewriting gives the chain rule
◮ P(X1, X2, · · · , Xn) = ∏
n
i=1 P(Xi|πXi)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 33
Calculation using Bayesian Networks
� P(¬J∧B∧E ∧A∧M) = 0.000000133
� P(¬J∧B∧¬E ∧A∧M) = 6.56684×10−5
� P(¬J∧B∧E ∧¬A∧M) = 9.5×10−10
� P(¬J∧B∧¬E ∧¬A∧M) = 5.6886×10−7
� P(¬J∧B∧E ∧A∧¬M) = 0.000000057
� P(¬J∧B∧¬E ∧A∧¬M) = 2.81436×10−5
� P(¬J∧B∧E ∧¬A∧¬M) = 9.405×10−8
� P(¬J∧B∧¬E ∧¬A∧¬M) = 5.63171×10−5
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 32
Calculation using Bayesian Networks
� P(J ∧B∧E ∧A∧M) = P(J|A).P(B).P(E).P(A|B∧E).P(M|A) =
0.90×0.001×0.002×0.95×0.70 = 0.000001197
� P(J∧B∧¬E ∧A∧M) = 0.0005910156
� P(J∧B∧E ∧¬A∧M) = 5×10−11
� P(J∧B∧¬E ∧¬A∧M) = 2.99×10−8
� P(J∧B∧E ∧A∧¬M) = 0.000000513
� P(J∧B∧¬E ∧A∧¬M) = 0.000253292
� P(J∧B∧E ∧¬A∧¬M) = 4.95×10−9
� P(J∧B∧¬E ∧¬A∧¬M) = 2.96406×10−6
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 35
Inference in Bayesian Networks
Diagnostic Inference: From effects to causes
P(Burglary|JohnCalls) = 0.016
Causal Inference: From causes to effects
P(JohnCalls|Burglary) = 0.85; P(MaryCalls|Burglary) = 0.67
Intercausal Inference: Explaining away
P(Burglary|Alarm)= 0.3736 but adding evidence, P(Burglary|Alarm∧
Earthquake) = 0.003; despite the fact that burglaries and earthquakes
are independent, the presence of one makes the other much less likely
Mixed Inference: Combinations of the patterns above
Diagnostic + Causal: P(Alarm|JohnCalls∧¬Earthquake)
Intercausal + Diagnostic: P(Burglary|JohnCalls∧¬Earthquake)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 34
Calculation using Bayesian Networks
� Therefore P(J|B) =
P(J∧B)
P(B)
=
ΣP(J∧B∧e∧a∧m)
ΣP( j∧B∧e∧a∧m)
= 0.00849017
0.001
� P(J|B) = 0.849017
� Can often simplify calculation without using full joint probabilities –
but not always
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 37
Example – Causal Inference
� P(JohnCalls|Burglary)
� P(J|B) = P(J|A∧B).P(A|B)+P(J|¬A∧B).P(¬A|B)
= P(J|A).P(A|B)+P(J|¬A).P(¬A|B)
= P(J|A).P(A|B)+P(J|¬A).(1−P(A|B))
� Now P(A|B) = P(A|B∧E).P(E|B)+P(A|B∧¬E).P(¬E|B)
= P(A|B∧E).P(E)+P(A|B∧¬E).P(¬E)
= 0.95×0.002+0.94×0.998 = 0.94002
� Therefore P(J|B) = 0.90×0.94002+0.05×0.05998 = 0.849017
� Fact 3: P(X |Z) = P(X |Y ∧Z).P(Y |Z)+P(X |¬Y ∧Z).P(¬Y |Z), since
X ∧Z ≡ (X ∧Y ∧Z)∨ (X ∧¬Y ∧Z) (conditional version of Fact 2)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 36
Inference in Bayesian Networks
Q
Q
Q
Q
E
E E E
E
Diagnostic Causal
Intercausal
Mixed
� Q = query; E = evidence
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 39
Conclusion
� Due to noise or uncertainty it is useful to reason with probabilities
� Calculating with joint probability distribution difficult due to the large
number of values
� Use of Bayes’ Rule and independence assumptions simplifies
reasoning
� Bayesian networks allow compact representation of probabilities and
efficient reasoning with probabilities
� Elegant recursive algorithms can be given to automate the process of
inference in Bayesian networks
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Uncertainty 38
Example – Diagnostic Inference
� P(Earthquake|Alarm)
� P(E|A) =
P(A|E).P(E)
P(A)
=
P(A|B∧E).P(B).P(E)+P(A|¬B∧E).P(¬B).P(E)
P(A)
= 0.95×0.001×0.002+0.29×0.999×0.002
P(A)
= 5.8132×10
−4
P(A)
� Now P(A) = P(A|B∧E).P(B).P(E)+P(A|¬B∧E).P(¬B).P(E)+
P(A|B∧¬E).P(B).P(¬E)+P(A|¬B∧¬E).P(¬B).P(¬E)
And P(A|B∧¬E).P(B).P(¬E)+P(A|¬B∧¬E).P(¬B).P(¬E)
= 0.94×0.001×0.998+0.001×0.999×0.998 = 0.001935122
So P(A) = 5.8132×10−4 +0.001935122 = 0.002516442
� Therefore P(E|A) = 5.8132×10
−4
0.002516442
= 0.2310087
� Fact 4: P(X ∧Y ) = P(X).P(Y ) if X , Y are conditionally independent
UNSW ©W. Wobcke et al. 2019–2021