L14 – Inference in Bayes Nets
EECS 391
Intro to AI
Inference in Bayes Nets
L14 Thu Oct 25
Recap: Modeling causal relationships with belief networks
Direct cause
A
B
Indirect cause
A
B C
Common cause Common effect
A B
C
A
B
C
P(B|A) P(B|A)
P(C|B)
P(B|A)
P(C|A)
P(C|A,B)
Defining the belief network
• Each link in the graph represents a conditional relationship between nodes.
• To compute the inference, we must specify the conditional probabilities.
• Let’s start with the open door. What do we specify?
W B P(O|W,B)
F F 0.01
F T 0.25
T F 0.05
T T 0.75
P(B)
0.001
P(W)
0.05
B P(D|B)
F 0.001
T 0.5
open door damaged door
burglarwife
car in garage
W P(C|W)
F 0.01
T 0.95
Finally, we specify
the remaining
conditionals
Now what?
Calculating probabilities using the joint distribution
• What the probability that the door is open, it is my wife and not a burglar, we see the
car in the garage, and the door is not damaged?
• Mathematically, we want to compute the expression: P(o,w,¬b,c,¬d) = ?
• We can just repeatedly apply the rule relating joint and conditional probabilities.
– P(x,y) = P(x|y) P(y)
Summary of inference with the joint probability distribution
• The complete (probabilistic) relationship between variables is specified by the
joint probability:
• All conditional and marginal distributions can be derived from this using the
basic rules of probability, the sum rule and the product rule
�(��,��, . . . ,��)
= �(�� = ��,�� = ��, . . . ,�� = ��)
P (X) =
�
Y
P (X,Y ) sum rule, “marginalization”
product ruleP (X,Y ) = P (Y |X)P (X) = P (X|Y )P (Y )
corollary, conditional probability
P (Y |X) =
P (X|Y )P (Y )
P (X)
corollary, Bayes rule
P (Y |X) =
P (X,Y )
P (X)
open door damaged door
burglarwife
car in garage
Calculating probabilities using the joint distribution
• The probability that the door is open, it is my wife and not a burglar, we see the car in
the garage, and the door is not damaged.
• P(o,w,¬b,c,¬d) = P(o|w,¬b,c,¬d)P(w,¬b,c,¬d)
= P(o|w,¬b)P(w,¬b,c,¬d)
= P(o|w,¬b)P(c|w,¬b,¬d)P(w,¬b,¬d)
= P(o|w,¬b)P(c|w)P(w,¬b,¬d)
= P(o|w,¬b)P(c|w)P(¬d|w,¬b)P(w,¬b)
= P(o|w,¬b)P(c|w)P(¬d|¬b)P(w,¬b)
= P(o|w,¬b)P(c|w)P(¬d|¬b)P(w)P(¬b)
open door damaged door
burglarwife
car in garage
Calculating probabilities using the joint distribution
• P(o,w,¬b,c,¬d) = P(o|w,¬b)P(c|w)P(¬d|¬b)P(w)P(¬b)
= 0.05 × 0.95 × 0.999 × 0.05 × 0.999 = 0.0024
• This is essentially the probability that my wife is home and leaves the door open.
W B P(O|W,B)
F F 0.01
F T 0.25
T F 0.05
T T 0.75
P(B)
0.001
P(W)
0.05
B P(D|B)
F 0.001
T 0.5
W P(C|W)
F 0.01
T 0.95
Calculating probabilities in a general Bayesian belief network
• Note that by specifying all the conditional probabilities, we have also specified the joint
probability. For the directed graph above:
P(A,B,C,D,E) = P(A) P(B|C) P(C|A) P(D|C,E) P(E|A,C)
• The general expression is:
P (x1, . . . , xn) ≡ P (X1 = x1 ∧ . . . ∧ Xn = xn)
=
n!
i=1
P (xi|parents(Xi))
● With this we can calculate (in principle) the probability of any joint probability.
C E
A
B
D
● This implies that we can also calculate any conditional probability.
For the burglar model
• The structure of this model allows a simple expression for the joint probability
open door damaged door
burglarwife
car in garage
W B P(O|W,B)
F F 0.01
F T 0.25
T F 0.05
T T 0.75
P(B)
0.001
P(W)
0.05
B P(D|B)
F 0.001
T 0.5
W P(C|W)
F 0.01
T 0.95
P (x1, . . . , xn) ≡ P (X1 = x1 ∧ . . . ∧ Xn = xn)
=
n!
i=1
P (xi|parents(Xi))
⇒ P (o, c, d, w, b) = P (c|w)P (o|w, b)P (d|b)P (w)P (b)
open door damaged door
burglarwife
car in garage
What if we want a simpler probabilistic question?
• How do we calculate P(b|o), i.e. the probability of a burglar given we see the open door?
• This is not an entry in the joint distribution. We had:
P(o,w,¬b,c,¬d) = P(o|w,¬b)P(c|w)P(¬d|¬b)P(w)P(¬b)
= 0.05 × 0.95 × 0.999 × 0.05 × 0.999 = 0.0024
W B P(O|W,B)
F F 0.01
F T 0.25
T F 0.05
T T 0.75
P(B)
0.001
P(W)
0.05
B P(D|B)
F 0.001
T 0.5
W P(C|W)
F 0.01
T 0.95
Calculating conditional probabilities
• So, how do we compute P(b|o)?
• Repeatedly apply laws of probability (factorization, marginalization, etc).
• Using the joint we can compute any conditional probability too
• The conditional probability of any one subset of variables given another disjoint subset is
where p ∈ S is shorthand for all the entries of the joint matching subset S.
• How many terms are in this sum?
P (S1|S2) =
P (S1 ∧ S2)
P (S2)
=
!
p ∈ S1 ∧ S2!
p ∈ S2
2
N
The number of terms in the sums is exponential in the number of variables.
In fact, general querying Bayes nets is NP complete.
Variable elimination on the burglary network
• We could do straight summation:
• But: the number of terms in the sum is exponential in the non-evidence variables.
• This is bad, and we can do much better.
• We start by observing that we can pull out many terms from the summation.
p(b|o) = α p(o, w, b, c, d)
= α
!
w,c,d
p(o|w, b)p(c|w)p(d|b)p(w)p(b)
Variable elimination
• When we’ve pulled out all the redundant terms we get:
• We can also note the last term sums to one. In fact, every variable that is not an
ancestor of a query variable or evidence variable is irrelevant to the query, so we get
which contains far fewer terms: In general, complexity is linear in the # of CPT entries.
p(b|o) = α p(b)
!
d
p(d|b)
!
w
p(w)p(o|w, b)
!
c
p(c|w)
p(b|o) = α p(b)
!
d
p(d|b)
!
w
p(w)p(o|w, b)
Variable elimination
• We can go even further.
• If we exchange the sums we get (with all the steps):
• We could have also achieved this by a more direct path.
p(b|o) = ↵
X
d
p(d|b)
X
w
p(w)p(o|w, b)
= ↵
X
d
X
w
p(d|b)p(w)p(o|w, b)
= ↵
X
w
X
d
p(d|b)p(w)p(o|w, b)
= ↵
X
w
p(w)p(o|w, b)
X
d
p(d|b)
= ↵
X
w
p(w)p(o|w, b) · 1
Variable elimination
• When we’ve pulled out all the redundant terms we get:
• which contains far fewer terms than the original expression.
• In general, complexity is linear in the # of CPT entries.
• This method is called variable elimination.
– if # of parents is bounded, also linear in the number of nodes.
– the expressions are evaluated in right-to-left order (bottom-up in the network)
– intermediate results are stored
– sums over each are done only for those expressions that depend on the variable
• Note: for multiply connected networks, variable elimination can have exponential
complexity in the worst case.
p(b|o) = ↵
X
w
p(w)p(o|w, b)
General inference questions in Bayesian networks
• For queries in Bayesian networks, we divide variables into three classes:
– evidence variables: e = {e1, . . . , em} what you know
– query variables: x = {x1, . . . , xn} what you want to know
– non-evidence variables: y = {y1, . . . , yl} what you don’t care about
• The complete set of variables in the network is {e ∪ x ∪ y}.
• Inferences in Bayesian networks consist of computing p(x|e), the posterior probability of
the query given the evidence:
• This computes the marginal distribution p(x,e) by summing the joint over all values of y.
• Recall that the joint distribution is defined by the product of the conditional pdfs:
where the product is taken over all variables in the network.
p(z) =
!
i=1
P (zi|parents(zi))
p(x|e) =
p(x, e)
p(e)
= α p(x, e) = α
!
y
p(x, e, y)
Another approach: Simplify model using clustering algorithms
• Inference is efficient if you have a polytree, ie a singly connected network.
• But what if you don’t?
• Idea: Convert a non-singly connected network to an equivalent singly connected
network.
Cloudy
RainSprinkler
Wet Grass
??
??
??
What should go into the nodes?
Clustering or join tree algorithms
Cloudy
RainSprinkler
Wet Grass
P(C)
0.5
C P(S|C)
T 0.1
F 0.5
C P(R|C)
T 0.8
F 0.2
S R P(W|S,R)
T T 0.99
T F 0.9
F T 0.9
F F 0.01
Cloudy
Spr+Rain
Wet Grass
P(C)
0.5
P(S+R=x|C)
C TT TF FT FF
T 0.08 0.02 0.72 0.18
F 0.1 0.4 0.1 0.4
S+R P(W|S+R)
T+T 0.99
T+F 0.9
F+T 0.9
F+F 0.01
Idea: merge multiply
connected nodes into
a single, higher-
dimensional case.
Can take exponential
time to construct CPTs
But approximate
algorithms usu. give
reasonable solutions.
So what do we do?
• The are special cases of Bayes nets for which there are fast, exact algorithms:
– variable elimination
– belief propagation
• There are also many approximations:
– stochastic (MCMC) approximations
– approximations
The complexity of multi-cause models
• In the models above, we specified the
joint conditional density by hand.
• This specified the probability of a
variable given each possible value of the
causal nodes.
• Can this be specified in a more generic
way?
• Can we avoid having to specify every
entry in the joint conditional pdf?
• For this we need to specify:
P(X | parents(X))
• The number of parameters (table
entries) scales exponentially with the
number of causes
open door
burglarwife
W B P(O|W,B)
F F 0.01
F T 0.25
T F 0.05
T T 0.75
Beyond tables: modeling causal relationships using Noisy-OR
• We assume each cause Cj can produce
effect Ei with probability fij.
• The noisy-OR model assumes the
parent causes of effect Ei contribute
independently.
• The probability that none of them
caused effect Ei is simply the product of
the probabilities that each one did not
cause Ei.
• The probability that any of them caused
Ei is just one minus the above, i.e.
P (Ei|par(Ei)) = P (Ei|C1, . . . , Cn)
= 1 −
!
i
(1 − P (Ei|Cj))
= 1 −
!
i
(1 − fij)
catch cold (C)
touch contaminated
object (O)
hit by viral
droplet (D)
f
CD
f
CO
eat contaminated
food (F)
f
CF
1 − (1 − fCD)(1 − fCO)(1 − fCF )
P (C|D, O, F ) =
Another noisy-OR example
ADAPTIVE PROBABILISTIC NETWORKS WITH HIDDEN VARIABLES 229
Table 2. Conditional probability table for P (Fever |Cold,Flu,Malaria), as calculated from the noisy-OR model.
Cold Flu Malaria P (Fever) P (¬Fever)
F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2� 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6� 0.1
T T F 0.88 0.12 = 0.6� 0.2
T T T 0.988 0.012 = 0.6� 0.2� 0.1
6.2. The noisy-OR model
Noisy-OR relationships generalize the logical notion of disjunctive causes. In propositional
logic, we might say Fever is true if and only if at least one of Cold, Flu, orMalaria is true.
The noisy-OR model adds some uncertainty to this strict logical description. The model
makes three assumptions. First, it assumes that each of the listed causes (Cold, Flu, and
Malaria) causes the effect (Fever), unless it is inhibited from doing so by some other cause.
Second, it assumes that whatever inhibits, say, Cold from causing a fever is independent
of whatever inhibits Flu from causing a fever. These inhibitors cause the uncertainty in
the model, giving rise to “noise parameters.” If P (Fever |Cold,¬Flu,¬Malaria) = 0.4,
P (Fever | ¬Cold,Flu,¬Malaria) = 0.8, andP (Fever | ¬Cold,¬Flu,Malaria) = 0.9, then
the noise parameters are 0.6, 0.2, and 0.1, respectively. Finally, it assumes that all the
possible causes are listed; that is, if none of the causes hold, the effect does not happen.
(This is not as strict as it seems, because we can always add a so-called leak node that covers
“miscellaneous causes.”)
These assumptions lead to a model where the causes—Cold, Flu, and Malaria—are the
parent nodes of the effect—Fever. The conditional probabilities are such that if no parent
node is true, then the effect node is false with 100% certainty. If exactly one parent is true,
then the effect is false with probability equal to the noise parameter for that node. In general,
the probability that the effect node is false is just the product of the noise parameters for
all the input nodes that are true. For the Fever variable in our example, we have the CPT
shown in Table 2.
The conditional probabilities for a noisy-OR node can therefore be described using a
number of parameters linear in the number of parents rather than exponential in the number
of parents. Let qip be the noise parameter for the pth parent node of Xi—that is, qip is the
probability that Xi is false given that only its pth parent is true, and let Tik denote the set
of parent nodes ofXi that are set to T in the kth assignment of values for the parent nodes
Ui. Then
wijk =
⇥ �
p�Tik qip if xij = F
1�
�
p�Tik qip if xij = T
. (6)
A more complex model with noisy-OR nodes
ADAPTIVE PROBABILISTIC NETWORKS WITH HIDDEN VARIABLES 231
lights
no oil no gas
starter
broken
battery age alternator
broken
fanbelt
broken
battery
dead no charging
battery
flat
engine won’t
startgas gauge
fuel line
blocked
oil light
Figure 6. A network with noisy-OR nodes. Each node with multiple parents has a CPT described by the noisy-OR
model (Equation 6). The network has 22 parameters, compared to 59 for a network that uses explicitly tabulated
conditional probabilities.
1.4
1.6
1.8
2
2.2
2.4
0 100 200 300 400 500
A
ve
ra
ge
n
eg
at
iv
e
lo
g
lik
el
ih
oo
d
pe
r c
as
e
Number of training cases
Explicit CPT network
Noisy!OR network
Target network
Figure 7. The output prediction error Ĥ(D0, Pw) as a function of the number of cases observed, for data generated
from the network shown in Figure 6. The plot shows learning curves for the APN algorithm using a noisy-OR
representation and using explicit CPTs.
• Could either model causes
and effects
• Or equivalently stochastic
binary features.
• Each input xi encodes the
probability that the ith binary
input feature is present.
• The set of features
represented by ϕj is defined
by weights fij which encode the
probability that feature i is an
instance of ϕj.
A general one-layer causal network
b c
a
Each column is a distinct eight-dimensional binary feature.
The data: a set of stochastic binary patterns
There are five underlying causal feature patterns.
What are they?
b c
a
Each column is a distinct eight-dimensional binary feature.
b c
a
true hidden causes of the data
The data: a set of stochastic binary patterns
This is a learning problem, which
we’ll cover in later lecture.
Hierarchical Statistical Models
A Bayesian belief network:
pa(Si)
Si
D
The joint probability of binary states is
P (S|W) =
⇤
i
P (Si|pa(Si),W)
The probability Si depends only on its parents:
P (Si|pa(Si),W) =
�
h(
⇥
j Sjwji) if Si = 1
1� h(
⇥
j Sjwji) if Si = 0
The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.
Main points:
• hierarchical structure allows model to form
high order representations
• upper states are priors for lower states
• weights encode higher order features