L15 – Inference in Bayes Nets
EECS 391
Intro to AI
Inference in Bayes Nets
L15 Tue Oct 30
Recap: Variable elimination on the burglary network
• We could do straight summation:
• But: the number of terms in the sum is exponential in the non-evidence variables.
• This is bad, and we can do much better.
• We start by observing that we can pull out many terms from the summation.
p(b|o) = α p(o, w, b, c, d)
= α
!
w,c,d
p(o|w, b)p(c|w)p(d|b)p(w)p(b)
Variable elimination
• When we’ve pulled out all the redundant terms we get:
• We can also note the last term sums to one. In fact, every variable that is not an
ancestor of a query variable or evidence variable is irrelevant to the query, so we get
which contains far fewer terms: In general, complexity is linear in the # of CPT entries.
p(b|o) = α p(b)
!
d
p(d|b)
!
w
p(w)p(o|w, b)
!
c
p(c|w)
p(b|o) = α p(b)
!
d
p(d|b)
!
w
p(w)p(o|w, b)
Variable elimination
• We can go even further.
• If we exchange the sums we get (with all the steps):
• We could have also achieved this by a more direct path.
p(b|o) = ↵
X
d
p(d|b)
X
w
p(w)p(o|w, b)
= ↵
X
d
X
w
p(d|b)p(w)p(o|w, b)
= ↵
X
w
X
d
p(d|b)p(w)p(o|w, b)
= ↵
X
w
p(w)p(o|w, b)
X
d
p(d|b)
= ↵
X
w
p(w)p(o|w, b) · 1
Variable elimination
• When we’ve pulled out all the redundant terms we get:
• which contains far fewer terms than the original expression.
• In general, complexity is linear in the # of CPT entries.
• This method is called variable elimination.
– if # of parents is bounded, also linear in the number of nodes.
– the expressions are evaluated in right-to-left order (bottom-up in the network)
– intermediate results are stored
– sums over each are done only for those expressions that depend on the variable
• Note: for multiply connected networks, variable elimination can have exponential
complexity in the worst case.
p(b|o) = ↵
X
w
p(w)p(o|w, b)
General inference questions in Bayesian networks
• For queries in Bayesian networks, we divide variables into three classes:
– evidence variables: e = {e1, . . . , em} what you know
– query variables: x = {x1, . . . , xn} what you want to know
– non-evidence variables: y = {y1, . . . , yl} what you don’t care about
• The complete set of variables in the network is {e ∪ x ∪ y}.
• Inferences in Bayesian networks consist of computing p(x|e), the posterior probability of
the query given the evidence:
• This computes the marginal distribution p(x,e) by summing the joint over all values of y.
• Recall that the joint distribution is defined by the product of the conditional pdfs:
where the product is taken over all variables in the network.
p(z) =
!
i=1
P (zi|parents(zi))
p(x|e) =
p(x, e)
p(e)
= α p(x, e) = α
!
y
p(x, e, y)
Another approach: Simplify model using clustering algorithms
• Inference is efficient if you have a polytree, ie a singly connected network.
• But what if you don’t?
• Idea: Convert a non-singly connected network to an equivalent singly connected
network.
Cloudy
RainSprinkler
Wet Grass
??
??
??
What should go into the nodes?
Clustering or join tree algorithms
Cloudy
RainSprinkler
Wet Grass
P(C)
0.5
C P(S|C)
T 0.1
F 0.5
C P(R|C)
T 0.8
F 0.2
S R P(W|S,R)
T T 0.99
T F 0.9
F T 0.9
F F 0.01
Cloudy
Spr+Rain
Wet Grass
P(C)
0.5
P(S+R=x|C)
C TT TF FT FF
T 0.08 0.02 0.72 0.18
F 0.1 0.4 0.1 0.4
S+R P(W|S+R)
T+T 0.99
T+F 0.9
F+T 0.9
F+F 0.01
Idea: merge multiply
connected nodes into
a single, higher-
dimensional case.
Can take exponential
time to construct CPTs
But approximate
algorithms usu. give
reasonable solutions.
So what do we do?
• The are special cases of Bayes nets for which there are fast, exact algorithms:
– variable elimination
– belief propagation
• There are also many approximations:
– stochastic (MCMC) approximations
– approximations
The complexity of multi-cause models
• In the models above, we specified the
joint conditional density by hand.
• This specified the probability of a
variable given each possible value of the
causal nodes.
• Can this be specified in a more generic
way?
• Can we avoid having to specify every
entry in the joint conditional pdf?
• For this we need to specify:
P(X | parents(X))
• The number of parameters (table
entries) scales exponentially with the
number of causes
open door
burglarwife
W B P(O|W,B)
F F 0.01
F T 0.25
T F 0.05
T T 0.75
Beyond tables: modeling causal relationships using Noisy-OR
• We assume each cause Cj can produce
effect Ei with probability fij.
• The noisy-OR model assumes the
parent causes of effect Ei contribute
independently.
• The probability that none of them
caused effect Ei is simply the product of
the probabilities that each one did not
cause Ei.
• The probability that any of them caused
Ei is just one minus the above, i.e.
catch cold (C)
touch contaminated
object (O)
hit by viral
droplet (D)
f
CD
f
CO
eat contaminated
food (F)
f
CF
1 − (1 − fCD)(1 − fCO)(1 − fCF )
P (C|D, O, F ) =
P (Ei|par(Ei)) = P (Ei|C1, . . . , Cn)
= 1�
Y
j
(1� P (Ei|Cj))
= 1�
Y
j
(1� fij)
Another noisy-OR example
ADAPTIVE PROBABILISTIC NETWORKS WITH HIDDEN VARIABLES 229
Table 2. Conditional probability table for P (Fever |Cold,Flu,Malaria), as calculated from the noisy-OR model.
Cold Flu Malaria P (Fever) P (¬Fever)
F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2� 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6� 0.1
T T F 0.88 0.12 = 0.6� 0.2
T T T 0.988 0.012 = 0.6� 0.2� 0.1
6.2. The noisy-OR model
Noisy-OR relationships generalize the logical notion of disjunctive causes. In propositional
logic, we might say Fever is true if and only if at least one of Cold, Flu, orMalaria is true.
The noisy-OR model adds some uncertainty to this strict logical description. The model
makes three assumptions. First, it assumes that each of the listed causes (Cold, Flu, and
Malaria) causes the effect (Fever), unless it is inhibited from doing so by some other cause.
Second, it assumes that whatever inhibits, say, Cold from causing a fever is independent
of whatever inhibits Flu from causing a fever. These inhibitors cause the uncertainty in
the model, giving rise to “noise parameters.” If P (Fever |Cold,¬Flu,¬Malaria) = 0.4,
P (Fever | ¬Cold,Flu,¬Malaria) = 0.8, andP (Fever | ¬Cold,¬Flu,Malaria) = 0.9, then
the noise parameters are 0.6, 0.2, and 0.1, respectively. Finally, it assumes that all the
possible causes are listed; that is, if none of the causes hold, the effect does not happen.
(This is not as strict as it seems, because we can always add a so-called leak node that covers
“miscellaneous causes.”)
These assumptions lead to a model where the causes—Cold, Flu, and Malaria—are the
parent nodes of the effect—Fever. The conditional probabilities are such that if no parent
node is true, then the effect node is false with 100% certainty. If exactly one parent is true,
then the effect is false with probability equal to the noise parameter for that node. In general,
the probability that the effect node is false is just the product of the noise parameters for
all the input nodes that are true. For the Fever variable in our example, we have the CPT
shown in Table 2.
The conditional probabilities for a noisy-OR node can therefore be described using a
number of parameters linear in the number of parents rather than exponential in the number
of parents. Let qip be the noise parameter for the pth parent node of Xi—that is, qip is the
probability that Xi is false given that only its pth parent is true, and let Tik denote the set
of parent nodes ofXi that are set to T in the kth assignment of values for the parent nodes
Ui. Then
wijk =
⇥ �
p�Tik qip if xij = F
1�
�
p�Tik qip if xij = T
. (6)
A more complex model with noisy-OR nodes
ADAPTIVE PROBABILISTIC NETWORKS WITH HIDDEN VARIABLES 231
lights
no oil no gas
starter
broken
battery age alternator
broken
fanbelt
broken
battery
dead no charging
battery
flat
engine won’t
startgas gauge
fuel line
blocked
oil light
Figure 6. A network with noisy-OR nodes. Each node with multiple parents has a CPT described by the noisy-OR
model (Equation 6). The network has 22 parameters, compared to 59 for a network that uses explicitly tabulated
conditional probabilities.
1.4
1.6
1.8
2
2.2
2.4
0 100 200 300 400 500
A
ve
ra
ge
n
eg
at
iv
e
lo
g
lik
el
ih
oo
d
pe
r c
as
e
Number of training cases
Explicit CPT network
Noisy!OR network
Target network
Figure 7. The output prediction error Ĥ(D0, Pw) as a function of the number of cases observed, for data generated
from the network shown in Figure 6. The plot shows learning curves for the APN algorithm using a noisy-OR
representation and using explicit CPTs.
• Could either model causes
and effects
• Or equivalently stochastic
binary features.
• Each input xi encodes the
probability that the ith binary
input feature is present.
• The set of features
represented by ϕj is defined
by weights fij which encode the
probability that feature i is an
instance of ϕj.
A general one-layer causal network
b c
a
Each column is a distinct eight-dimensional binary feature.
The data: a set of stochastic binary patterns
There are five underlying causal feature patterns.
What are they?
b c
a
Each column is a distinct eight-dimensional binary feature.
b c
a
true hidden causes of the data
The data: a set of stochastic binary patterns
This is a learning problem, which
we’ll cover in later lecture.
Hierarchical Statistical Models
A Bayesian belief network:
pa(Si)
Si
D
The joint probability of binary states is
P (S|W) =
⇤
i
P (Si|pa(Si),W)
The probability Si depends only on its parents:
P (Si|pa(Si),W) =
�
h(
⇥
j Sjwji) if Si = 1
1� h(
⇥
j Sjwji) if Si = 0
The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.
Main points:
• hierarchical structure allows model to form
high order representations
• upper states are priors for lower states
• weights encode higher order features
Another approach: Inference by stochastic simulation
Basic idea:
1. Draw N samples from a sampling distribution S
2. Compute an approximate posterior probability
3. Show this converges to the true probability
Sampling from an empty network
function Prior-Sample(bn) returns an event sampled from bn
inputs: bn, a belief network specifying joint distribution P(X1, . . . , Xn)
x← an event with n elements
for i = 1 to n do
xi ← a random sample from P(Xi | parents(Xi))
given the values of Parents(Xi) in x
return x
Chapter 14.4–5 14
Sampling with no evidence (from the prior)
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 15
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 16
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 17
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 18
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 19
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 20
Example
Cloudy
RainSprinkler
Wet
Grass
C
T
F
.80
.20
P(R|C)C
T
F
.10
.50
P(S|C)
S R
T T
T F
F T
F F
.90
.90
.99
P(W|S,R)
P(C)
.50
.01
Chapter 14.4–5 21
Keep sampling
to estimate joint
probabilities of
interest.
Rejection sampling
P̂(X|e) estimated from samples agreeing with e
function Rejection-Sampling(X,e, bn,N) returns an estimate of P (X |e)
local variables: N, a vector of counts over X, initially zero
for j = 1 to N do
x←Prior-Sample(bn)
if x is consistent with e then
N[x]←N[x]+1 where x is the value of X in x
return Normalize(N[X])
E.g., estimate P(Rain|Sprinkler = true) using 100 samples
27 samples have Sprinkler = true
Of these, 8 have Rain = true and 19 have Rain = false.
P̂(Rain|Sprinkler = true) = Normalize(⟨8, 19⟩) = ⟨0.296, 0.704⟩
Similar to a basic real-world empirical estimation procedure
Chapter 14.4–5 23
What if we do have some evidence? Rejection sampling.
Analysis of rejection sampling
P̂(X|e) = αNPS(X, e) (algorithm defn.)
= NPS(X, e)/NPS(e) (normalized by NPS(e))
≈ P(X, e)/P (e) (property of PriorSample)
= P(X|e) (defn. of conditional probability)
Hence rejection sampling returns consistent posterior estimates
Problem: hopelessly expensive if P (e) is small
P (e) drops off exponentially with number of evidence variables!
Chapter 14.4–5 24
Analysis of rejection sampling
Approximate inference using MCMC
“State” of network = current assignment to all variables.
Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed
function MCMC-Ask(X,e, bn,N) returns an estimate of P (X |e)
local variables: N[X ], a vector of counts over X, initially zero
Z, the nonevidence variables in bn
x, the current state of the network, initially copied from e
initialize x with random values for the variables in Y
for j = 1 to N do
for each Zi in Z do
sample the value of Zi in x from P(Zi |mb(Zi))
given the values of MB(Zi) in x
N[x ]←N[x ] + 1 where x is the value of X in x
return Normalize(N[X ])
Can also choose a variable to sample at random each time
Chapter 14.4–5 34
Approximate inference using Markov Chain Monte Carlo (MCMC)
The extent of dependencies in Bayesian networks
• A node X is conditionally independent
of its non-descendants given its
parents
• A node X is conditionally independent
of all the other nodes in the network
given its Markov blanket.
X X
The Markov chain
With Sprinkler = true, WetGrass = true, there are four states:
Cloudy
RainSprinkler
Wet
Grass
Cloudy
RainSprinkler
Wet
Grass
Cloudy
RainSprinkler
Wet
Grass
Cloudy
RainSprinkler
Wet
Grass
Wander about for a while, average what you see
Chapter 14.4–5 35
The Markov chain
MCMC example contd.
Estimate P(Rain|Sprinkler = true,WetGrass = true)
Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.
E.g., visit 100 states
31 have Rain = true, 69 have Rain = false
P̂(Rain|Sprinkler = true,WetGrass = true)
= Normalize(⟨31, 69⟩) = ⟨0.31, 0.69⟩
Theorem: chain approaches stationary distribution:
long-run fraction of time spent in each state is exactly
proportional to its posterior probability
Chapter 14.4–5 36
After obtaining the MCMC samples
But:
1. Difficult to tell when samples have converged. Theorem only applies in
limit, and it could take time to “settle in”.
2. Can also be inefficient if each state depends on many other variables.
• Model represents stochastic
binary features.
• Each input xi encodes the
probability that the ith binary
input feature is present.
• The set of features represented
by ϕj is defined by weights fij
which encode the probability
that feature i is an instance of ϕj.
• Trick: It’s easier to adapt weights
in an unbounded space, so use
the transformation:
• optimize in w-space.
Gibbs sampling (back to the noisy-OR example)
Beyond tables: modeling causal relationships using Noisy-OR
• We assume each cause Cj can produce
effect Ei with probability fij.
• The noisy-OR model assumes the
parent causes of effect Ei contribute
independently.
• The probability that none of them
caused effect Ei is simply the product of
the probabilities that each one did not
cause Ei.
• The probability that any of them caused
Ei is just one minus the above, i.e.
P (Ei|par(Ei)) = P (Ei|C1, . . . , Cn)
= 1 −
!
i
(1 − P (Ei|Cj))
= 1 −
!
i
(1 − fij)
catch cold (C)
touch contaminated
object (O)
hit by viral
droplet (D)
f
CD
f
CO
eat contaminated
food (F)
f
CF
1 − (1 − fCD)(1 − fCO)(1 − fCF )
P (C|D, O, F ) =
Hierarchical Statistical Models
A Bayesian belief network:
pa(Si)
Si
D
The joint probability of binary states is
P (S|W) =
⇤
i
P (Si|pa(Si),W)
The probability Si depends only on its parents:
P (Si|pa(Si),W) =
�
h(
⇥
j Sjwji) if Si = 1
1� h(
⇥
j Sjwji) if Si = 0
The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.
Main points:
• hierarchical structure allows model to form
high order representations
• upper states are priors for lower states
• weights encode higher order features
Inferring the best representation of the observed variables
• Given on the input D, the is no simple way to determine which states are the
input’s most likely causes.
– Computing the most probable network state is an inference process
– we want to find the explanation of the data with highest probability
– this can be done efficiently with Gibbs sampling
• Gibbs sampling is another example of an MCMC method
• Key idea:
The samples are guaranteed to converge to the
true posterior probability distribution
Gibbs Sampling
Gibbs sampling is a way to select an ensemble of states that are representative of
the posterior distribution P (S|D,W).
• Each state of the network is updated iteratively according to the probability of
Si given the remaining states.
• this conditional probability can be computed using (Neal, 1992)
P (Si = a|Sj : j ⇤= i,W) ⇥ P (Si = a|pa(Si),W)
�
j2ch(Si)
P (Sj|pa(Sj), Si = a,W)
• limiting ensemble of states will be typical samples from P (S|D,W)
• also works if any subset of states are fixed and the rest are sampled
Network Interpretation of the Gibbs Sampling Equation
The probability of Si changing state given the remaining states is
P (Si = 1�Si|Sj : j ⇤= i,W) =
1
1 + exp(��xi)
�xi indicates how much changing the state Si changes the probability of the whole
network state
�xi = log h(ui; 1�Si)� log h(ui;Si)
+
⇥
j2ch(Si)
log h(uj + �ij;Sj)� log h(uj;Sj)
• ui is the causal input to Si, ui =
�
k Skwki
• �j specifies the change in uj for a change in Si,
�ij = +Sjwij if Si = 0, or �Sjwij if Si = 1
The Gibbs sampling equations (derivation omitted)
Interpretation of the Gibbs sampling equation
• The Gibbs equation can be interpreted as: feedback + ∑ feedforward
• feed-back: how consistent is Si with current causes?
• ∑ feedforward: how likely is Si a cause of its children
• feedback allows the lower-level units to use information only
computable at higher levels
• feedback determines (disambiguates) the state when the
feedforward input is ambiguous
The Shifter Problem
BA
Shift patterns
weights of a 32-20-2 network after
learning
Gibbs sampling: feedback disambiguates lower-level states
1
2
3
4
One the structure learned, the Gibbs updating convergences in two sweeps.
The higher-order lines problem
A
0.6
0.1 0.1 0.10.1
B
The true generative model Patterns sampled from the model
Can we infer the structure of the network given only the patterns?
Weights in a 25-10-5 belief network after learning
The first layer of weights learn that
patterns are combinations of lines.
The second layer learns combinations of the first layer features
A
0.6
0.1 0.1 0.10.1
B