CM3112 Artificial Intelligence Knowledge and reasoning:
Bayesian networks
Steven Schockaert
SchockaertS1@cardiff.ac.uk
School of Computer Science & Informatics Cardiff University
Probabilistic inference: recap
Inference by enumeration is based on an explicit encoding of the probability of each propositional interpretation (i.e. possible worlds)
‣ general: allows us to find all probabilities and conditional probabilities of interest
‣ expensive: requires an exponential number of values to be specified, and has an exponential worst-case time complexity
Bayesian inference with the Naive Bayes assumption is based on specifying the likelihood of different causes, given that particular effects are present
‣ efficient: by assuming that effects are conditionally independent, the number of values we need to store is linear in the number of effects and in the number of causes
‣ restrictive: the Naive Bayes assumption is often unrealistic
Bayesian networks try to combine the best of both worlds
Bayesian networks
Bayesian networks: example
Example contd.
P(B)
.001
P(E)
.002
Burglary
Earthquake
The probability that there’s a
B E P(A|B,E)
burglary is independent of whether or not there’s an
T T .95
Alarm
The probability that Mary calls is conditionally independent of the probability that John calls, given the knowledge of whether or not the alarm has gone off
MaryCalls
earthquake
T F .94 F T .29
F F .001
JohnCalls
A P(J|A)
T F
.90 .05
A P(M|A)
.70 .01
T F
Knowledge on the (conditional) independence of atoms is encoded as a
directed acyclic graph (DAG)
Bayesian networks: example
Example contd.
P(E)
.002
P(B)
.001
Burglary
Earthquake
There’s a 0.2% chance that there’s an earthquake
BE
P(A|B,E)
TT
TF FT
FF
.95
.94 .29
.001
Alarm
A
P(J|A)
T F
.90 .05
A
P(M|A)
T F
.70 .01
JohnCalls
MaryCalls
For each atom, a conditional probability table (CPT) encodes the probability that the atom is true, given which of its parents in the DAG are true
Bayesian networks: example
Example contd.
P(E)
.002
P(B)
.001
Burglary
Earthquake
If there is a burglary and an earthquake, there’s a 95% chance that the alarm will go off
BE
P(A|B,E)
TT
.95
TF FT
FF
.94 .29
.001
Alarm
A
P(J|A)
T F
.90 .05
A
P(M|A)
T F
.70 .01
JohnCalls
MaryCalls
For each atom, a conditional probability table (CPT) encodes the probability that the atom is true, given which of its parents in the DAG are true
Properties
If every node has at most k parents, then we need to store O(n·2k) values to encode a network with n nodes
Assuming the maximum number of parents k is a fixed constant, the size of a Bayesian network is thus linear in the number of atoms
In general, a Bayesian network can represent any joint probability distribution
To encode networks in a more compact way, we can let each node represent an arbitrary random variable, instead of an atomic proposition
Bayesian networks: example
Degree classification
likes AI
likes programming
likes probability
P(Degree)
first 2.1 2.2 third fail
0.2 0.3 0.3 0.1 0.1
Degree
P(AI | Degree)
first 2.1
2.2 third fail
0.7 0.6
0.4 0.3 0.2
Degree
P(Prob | Degree)
first 2.1
2.2 third fail
0.6 0.4
0.2 0.1 0.1
Degree
P(Prog | Degree)
first 2.1
2.2 third fail
0.9 0.8
0.7 0.6 0.5
Conditional independence
Previously we have defined conditional independence for events
This definition naturally leads to a notion of conditional independence for random variables:
We say that random variables X and Y are conditionally independent given the random variables Z1,…,Zn if the events X=a and Y=b are conditionally independent of the events Z1=c1,…,Zn=cn for all values a, b, c1, …, cn in the domain of these random variables
Conditional independence assumptions
Consider a Bayesian network with nodes corresponding to the random variables X1,…,Xn
The network structure of a Bayesian network encodes the following conditional independence assumption:
A random variable X is conditionally independent of its non-descendants, given the random variables in parents(X)
Example
Example contd.
Given the value of Alarm, JohnCalls is conditionally independent from Burglary, EarthQuake and MaryCalls
Given the value of Alarm, MaryCalls is conditionally independent from Burglary, EarthQuake and JohnCalls
Burglary is conditionally independent of
Burglary
Earthquake
E P(A|B,E)
T .95 F .94 T .29 F .001
P(B)
.001
Alarm
P(E)
.002
A P(J|A)
T .90 F .05
A
T F
P(M|A)
Earthquake
JohnCalls
MaryCalls .70 .01
Earthquake is conditionally independent of Burglary
Chapter 14.1–3 6
B
T T F F
BX
T T F F
|{| } = P(X=x|{Yi =yi|Yi isaparentofX}
X
¬ |¬
X
Example
P X=x|{Yi =yi|Yi isanodeinthenetwork} P X=x|{Y =y |Y isanodeinthenetwork}
Burglary
Earthquake
Example contd.
P X=x|{Yi=yi|Yi isnotadescendantofX}) iii
P(B)
.001
P(E)
.002
PX
iii
P X=x|{Y =y |Y isnotadescendanto fX})
=P X=x|{Y =y |Y isintheMarkovblanketof
iii
=P X=x|{Yi =yi|Yi isintheMarkovblanketof
= P(PX(j=ohxn|C{aYlls=|¬yala|rYm,iesaarthpqauraenket,o¬fmXar}yCalls) iii
= P(PX(jo=hnxC|{aYlls|=¬aylar|mY,eiasrtahqpuaarkeen,t¬omfaXry}Calls) iii
= P (johnCalls|¬alarm) = P (johnCalls|¬alarm)
=x|{Yi =yi|Yi isanodeinthenetwork} P X=x|{Yi =yi|Yi isanodeinthenetwork}
E P(A|B,E)
T .95 F .94 T .29 F .001
Alarm
= P (¬maryCalls|¬alarm) = P (¬maryCalls|¬alarm)
P (¬maryCalls|¬alarm, ¬earthquake, johnCalls) P (¬maryCalls|¬alarm, ¬earthquake, johnCalls)
=P X=x|{Yi =yi|Yi isintheMarkovblanketof
=P X=x|{Y =y |Y isintheMarkovblanketof iii
A P(J|A)
T .90 F .05
A
T F
JohnCalls
MaryCalls
P (burglary|¬earthquake) = P (burglary) P (burglary|¬earthquake) = P (burglary)
P(M|A)
.70 .01
P (earthquake|burglary) = P (earthquake) P (earthquake|burglary) = P (earthquake)
We write alarm as an abbreviation for Alarm=true and ¬alarm as an
P (¬maryCalls|¬alarm, ¬earthquake, johnCalls) Chapter 14.1–3 6
P (¬maryCalls|¬alarm, ¬earthquake, johnCalls) abbreviation for Alarm=false, and similar for the other properties
= P (¬maryCalls|¬alarm) = P ( maryCalls alarm)