程序代写代做代考 Bayesian network Bayesian algorithm AI L14 – Inference in Bayes Nets

L14 – Inference in Bayes Nets

EECS 391
Intro to AI

Inference in Bayes Nets

L14 Thu Oct 25

Recap: Modeling causal relationships with belief networks

Direct cause

A

B

Indirect cause

A

B C

Common cause Common effect

A B

C

A

B

C

P(B|A) P(B|A)
P(C|B)

P(B|A)
P(C|A)

P(C|A,B)

Defining the belief network

• Each link in the graph represents a conditional relationship between nodes.
• To compute the inference, we must specify the conditional probabilities.
• Let’s start with the open door. What do we specify?

W B P(O|W,B)

F F 0.01

F T 0.25

T F 0.05

T T 0.75

P(B)

0.001

P(W)

0.05

B P(D|B)

F 0.001

T 0.5

open door damaged door

burglarwife

car in garage

W P(C|W)

F 0.01

T 0.95

Finally, we specify
the remaining
conditionals

Now what?

Calculating probabilities using the joint distribution

• What the probability that the door is open, it is my wife and not a burglar, we see the
car in the garage, and the door is not damaged?

• Mathematically, we want to compute the expression: P(o,w,¬b,c,¬d) = ?
• We can just repeatedly apply the rule relating joint and conditional probabilities.

– P(x,y) = P(x|y) P(y)

Summary of inference with the joint probability distribution

• The complete (probabilistic) relationship between variables is specified by the
joint probability: 



• All conditional and marginal distributions can be derived from this using the
basic rules of probability, the sum rule and the product rule 


�(��,��, . . . ,��)
= �(�� = ��,�� = ��, . . . ,�� = ��)

P (X) =

Y

P (X,Y ) sum rule, “marginalization”

product ruleP (X,Y ) = P (Y |X)P (X) = P (X|Y )P (Y )

corollary, conditional probability

P (Y |X) =
P (X|Y )P (Y )

P (X)
corollary, Bayes rule

P (Y |X) =
P (X,Y )

P (X)
AAACBXicbZBPS8MwGMbT+W/Of1WPeggOYQMZrQiKIAy8eKzgto6tjDRLt7A0LUkqjNqLF7+KFw+KePU7ePPbmG476OYLIT+e531J3sePGZXKsr6NwtLyyupacb20sbm1vWPu7jVllAhMGjhikXB9JAmjnDQUVYy4sSAo9Blp+aPr3G/dEyFpxO/UOCZeiAacBhQjpaWeeehU2g9uFV7BbiAQTp2Ke9KuZvldzXpm2apZk4KLYM+gDGbl9Myvbj/CSUi4wgxJ2bGtWHkpEopiRrJSN5EkRniEBqSjkaOQSC+dbJHBY630YRAJfbiCE/X3RIpCKcehrztDpIZy3svF/7xOooILL6U8ThThePpQkDCoIphHAvtUEKzYWAPCguq/QjxEOg2lgyvpEOz5lReheVqzNd+eleuXsziK4AAcgQqwwTmogxvggAbA4BE8g1fwZjwZL8a78TFtLRizmX3wp4zPHyK1lmU=AAACBXicbZBPS8MwGMbT+W/Of1WPeggOYQMZrQiKIAy8eKzgto6tjDRLt7A0LUkqjNqLF7+KFw+KePU7ePPbmG476OYLIT+e531J3sePGZXKsr6NwtLyyupacb20sbm1vWPu7jVllAhMGjhikXB9JAmjnDQUVYy4sSAo9Blp+aPr3G/dEyFpxO/UOCZeiAacBhQjpaWeeehU2g9uFV7BbiAQTp2Ke9KuZvldzXpm2apZk4KLYM+gDGbl9Myvbj/CSUi4wgxJ2bGtWHkpEopiRrJSN5EkRniEBqSjkaOQSC+dbJHBY630YRAJfbiCE/X3RIpCKcehrztDpIZy3svF/7xOooILL6U8ThThePpQkDCoIphHAvtUEKzYWAPCguq/QjxEOg2lgyvpEOz5lReheVqzNd+eleuXsziK4AAcgQqwwTmogxvggAbA4BE8g1fwZjwZL8a78TFtLRizmX3wp4zPHyK1lmU=AAACBXicbZBPS8MwGMbT+W/Of1WPeggOYQMZrQiKIAy8eKzgto6tjDRLt7A0LUkqjNqLF7+KFw+KePU7ePPbmG476OYLIT+e531J3sePGZXKsr6NwtLyyupacb20sbm1vWPu7jVllAhMGjhikXB9JAmjnDQUVYy4sSAo9Blp+aPr3G/dEyFpxO/UOCZeiAacBhQjpaWeeehU2g9uFV7BbiAQTp2Ke9KuZvldzXpm2apZk4KLYM+gDGbl9Myvbj/CSUi4wgxJ2bGtWHkpEopiRrJSN5EkRniEBqSjkaOQSC+dbJHBY630YRAJfbiCE/X3RIpCKcehrztDpIZy3svF/7xOooILL6U8ThThePpQkDCoIphHAvtUEKzYWAPCguq/QjxEOg2lgyvpEOz5lReheVqzNd+eleuXsziK4AAcgQqwwTmogxvggAbA4BE8g1fwZjwZL8a78TFtLRizmX3wp4zPHyK1lmU=AAACBXicbZBPS8MwGMbT+W/Of1WPeggOYQMZrQiKIAy8eKzgto6tjDRLt7A0LUkqjNqLF7+KFw+KePU7ePPbmG476OYLIT+e531J3sePGZXKsr6NwtLyyupacb20sbm1vWPu7jVllAhMGjhikXB9JAmjnDQUVYy4sSAo9Blp+aPr3G/dEyFpxO/UOCZeiAacBhQjpaWeeehU2g9uFV7BbiAQTp2Ke9KuZvldzXpm2apZk4KLYM+gDGbl9Myvbj/CSUi4wgxJ2bGtWHkpEopiRrJSN5EkRniEBqSjkaOQSC+dbJHBY630YRAJfbiCE/X3RIpCKcehrztDpIZy3svF/7xOooILL6U8ThThePpQkDCoIphHAvtUEKzYWAPCguq/QjxEOg2lgyvpEOz5lReheVqzNd+eleuXsziK4AAcgQqwwTmogxvggAbA4BE8g1fwZjwZL8a78TFtLRizmX3wp4zPHyK1lmU=

open door damaged door

burglarwife

car in garage

Calculating probabilities using the joint distribution

• The probability that the door is open, it is my wife and not a burglar, we see the car in
the garage, and the door is not damaged.

• P(o,w,¬b,c,¬d) = P(o|w,¬b,c,¬d)P(w,¬b,c,¬d)
= P(o|w,¬b)P(w,¬b,c,¬d)

= P(o|w,¬b)P(c|w,¬b,¬d)P(w,¬b,¬d)

= P(o|w,¬b)P(c|w)P(w,¬b,¬d)

= P(o|w,¬b)P(c|w)P(¬d|w,¬b)P(w,¬b)

= P(o|w,¬b)P(c|w)P(¬d|¬b)P(w,¬b)

= P(o|w,¬b)P(c|w)P(¬d|¬b)P(w)P(¬b)

open door damaged door

burglarwife

car in garage

Calculating probabilities using the joint distribution

• P(o,w,¬b,c,¬d) = P(o|w,¬b)P(c|w)P(¬d|¬b)P(w)P(¬b)
= 0.05 × 0.95 × 0.999 × 0.05 × 0.999 = 0.0024

• This is essentially the probability that my wife is home and leaves the door open.

W B P(O|W,B)

F F 0.01

F T 0.25

T F 0.05

T T 0.75

P(B)

0.001

P(W)

0.05

B P(D|B)

F 0.001

T 0.5

W P(C|W)

F 0.01

T 0.95

Calculating probabilities in a general Bayesian belief network

• Note that by specifying all the conditional probabilities, we have also specified the joint
probability. For the directed graph above:

P(A,B,C,D,E) = P(A) P(B|C) P(C|A) P(D|C,E) P(E|A,C)

• The general expression is:
P (x1, . . . , xn) ≡ P (X1 = x1 ∧ . . . ∧ Xn = xn)

=
n!

i=1

P (xi|parents(Xi))

● With this we can calculate (in principle) the probability of any joint probability.

C E

A

B
D

● This implies that we can also calculate any conditional probability.

For the burglar model

• The structure of this model allows a simple expression for the joint probability

open door damaged door

burglarwife

car in garage

W B P(O|W,B)

F F 0.01

F T 0.25

T F 0.05

T T 0.75

P(B)

0.001

P(W)

0.05

B P(D|B)

F 0.001

T 0.5

W P(C|W)

F 0.01

T 0.95

P (x1, . . . , xn) ≡ P (X1 = x1 ∧ . . . ∧ Xn = xn)

=
n!

i=1

P (xi|parents(Xi))

⇒ P (o, c, d, w, b) = P (c|w)P (o|w, b)P (d|b)P (w)P (b)

open door damaged door

burglarwife

car in garage

What if we want a simpler probabilistic question?

• How do we calculate P(b|o), i.e. the probability of a burglar given we see the open door?
• This is not an entry in the joint distribution. We had: 


P(o,w,¬b,c,¬d) = P(o|w,¬b)P(c|w)P(¬d|¬b)P(w)P(¬b) 

= 0.05 × 0.95 × 0.999 × 0.05 × 0.999 = 0.0024

W B P(O|W,B)

F F 0.01

F T 0.25

T F 0.05

T T 0.75

P(B)

0.001

P(W)

0.05

B P(D|B)

F 0.001

T 0.5

W P(C|W)

F 0.01

T 0.95

Calculating conditional probabilities

• So, how do we compute P(b|o)?
• Repeatedly apply laws of probability (factorization, marginalization, etc).
• Using the joint we can compute any conditional probability too
• The conditional probability of any one subset of variables given another disjoint subset is

where p ∈ S is shorthand for all the entries of the joint matching subset S.

• How many terms are in this sum?

P (S1|S2) =
P (S1 ∧ S2)

P (S2)
=

!
p ∈ S1 ∧ S2!

p ∈ S2

2
N

The number of terms in the sums is exponential in the number of variables.

In fact, general querying Bayes nets is NP complete.

Variable elimination on the burglary network

• We could do straight summation: 



• But: the number of terms in the sum is exponential in the non-evidence variables.
• This is bad, and we can do much better.
• We start by observing that we can pull out many terms from the summation.

p(b|o) = α p(o, w, b, c, d)

= α
!

w,c,d

p(o|w, b)p(c|w)p(d|b)p(w)p(b)

Variable elimination

• When we’ve pulled out all the redundant terms we get: 


• We can also note the last term sums to one. In fact, every variable that is not an
ancestor of a query variable or evidence variable is irrelevant to the query, so we get 




which contains far fewer terms: In general, complexity is linear in the # of CPT entries.

p(b|o) = α p(b)
!

d

p(d|b)
!

w

p(w)p(o|w, b)
!

c

p(c|w)

p(b|o) = α p(b)
!

d

p(d|b)
!

w

p(w)p(o|w, b)

Variable elimination

• We can go even further.
• If we exchange the sums we get (with all the steps): 













• We could have also achieved this by a more direct path.

p(b|o) = ↵
X

d

p(d|b)
X

w

p(w)p(o|w, b)

= ↵
X

d

X

w

p(d|b)p(w)p(o|w, b)

= ↵
X

w

X

d

p(d|b)p(w)p(o|w, b)

= ↵
X

w

p(w)p(o|w, b)
X

d

p(d|b)

= ↵
X

w

p(w)p(o|w, b) · 1
AAAC+HiclVJNSwMxEM2uX7V+tXr0EixKCyK7IiiCIHjxWMGq0C0lm03bYHYTkqxL3fpLvHhQxKs/xZv/xuy2iLYe6kDCmzfzMslkfMGo0o7zadkzs3PzC4XF4tLyyupaqbx+pXgsMWlgzri88ZEijEakoalm5EZIgkKfkWv/9iyLX98RqSiPLnVfkFaIuhHtUIy0odpla1VU/QGvwZ0T6CEmegh6Kg7bARTVYODXhl5ivKRmNj5IdjPSK8KhTci+83P1lLJkrOjUsrHEX4f8Q4cDrqGbKdqlirPn5AYngTsCFTCyerv04QUcxyGJNGZIqabrCN1KkdQUM/JQ9GJFBMK3qEuaBkYoJKqV5h/3ALcNE8AOl2ZFGubsT0WKQqX6oW8yQ6R7ajyWkX/FmrHuHLVSGolYkwgPC3ViBjWH2RTAgEqCNesbgLCk5q4Q95BEWJtZKZomuONPngRX+3uuwRcHldPjUTsKYBNsgSpwwSE4BeegDhoAW7H1aD1bL/a9/WS/2m/DVNsaaTbAL7PfvwAfSd1tAAAC+HiclVJNSwMxEM2uX7V+tXr0EixKCyK7IiiCIHjxWMGq0C0lm03bYHYTkqxL3fpLvHhQxKs/xZv/xuy2iLYe6kDCmzfzMslkfMGo0o7zadkzs3PzC4XF4tLyyupaqbx+pXgsMWlgzri88ZEijEakoalm5EZIgkKfkWv/9iyLX98RqSiPLnVfkFaIuhHtUIy0odpla1VU/QGvwZ0T6CEmegh6Kg7bARTVYODXhl5ivKRmNj5IdjPSK8KhTci+83P1lLJkrOjUsrHEX4f8Q4cDrqGbKdqlirPn5AYngTsCFTCyerv04QUcxyGJNGZIqabrCN1KkdQUM/JQ9GJFBMK3qEuaBkYoJKqV5h/3ALcNE8AOl2ZFGubsT0WKQqX6oW8yQ6R7ajyWkX/FmrHuHLVSGolYkwgPC3ViBjWH2RTAgEqCNesbgLCk5q4Q95BEWJtZKZomuONPngRX+3uuwRcHldPjUTsKYBNsgSpwwSE4BeegDhoAW7H1aD1bL/a9/WS/2m/DVNsaaTbAL7PfvwAfSd1tAAAC+HiclVJNSwMxEM2uX7V+tXr0EixKCyK7IiiCIHjxWMGq0C0lm03bYHYTkqxL3fpLvHhQxKs/xZv/xuy2iLYe6kDCmzfzMslkfMGo0o7zadkzs3PzC4XF4tLyyupaqbx+pXgsMWlgzri88ZEijEakoalm5EZIgkKfkWv/9iyLX98RqSiPLnVfkFaIuhHtUIy0odpla1VU/QGvwZ0T6CEmegh6Kg7bARTVYODXhl5ivKRmNj5IdjPSK8KhTci+83P1lLJkrOjUsrHEX4f8Q4cDrqGbKdqlirPn5AYngTsCFTCyerv04QUcxyGJNGZIqabrCN1KkdQUM/JQ9GJFBMK3qEuaBkYoJKqV5h/3ALcNE8AOl2ZFGubsT0WKQqX6oW8yQ6R7ajyWkX/FmrHuHLVSGolYkwgPC3ViBjWH2RTAgEqCNesbgLCk5q4Q95BEWJtZKZomuONPngRX+3uuwRcHldPjUTsKYBNsgSpwwSE4BeegDhoAW7H1aD1bL/a9/WS/2m/DVNsaaTbAL7PfvwAfSd1tAAAC+HiclVJNSwMxEM2uX7V+tXr0EixKCyK7IiiCIHjxWMGq0C0lm03bYHYTkqxL3fpLvHhQxKs/xZv/xuy2iLYe6kDCmzfzMslkfMGo0o7zadkzs3PzC4XF4tLyyupaqbx+pXgsMWlgzri88ZEijEakoalm5EZIgkKfkWv/9iyLX98RqSiPLnVfkFaIuhHtUIy0odpla1VU/QGvwZ0T6CEmegh6Kg7bARTVYODXhl5ivKRmNj5IdjPSK8KhTci+83P1lLJkrOjUsrHEX4f8Q4cDrqGbKdqlirPn5AYngTsCFTCyerv04QUcxyGJNGZIqabrCN1KkdQUM/JQ9GJFBMK3qEuaBkYoJKqV5h/3ALcNE8AOl2ZFGubsT0WKQqX6oW8yQ6R7ajyWkX/FmrHuHLVSGolYkwgPC3ViBjWH2RTAgEqCNesbgLCk5q4Q95BEWJtZKZomuONPngRX+3uuwRcHldPjUTsKYBNsgSpwwSE4BeegDhoAW7H1aD1bL/a9/WS/2m/DVNsaaTbAL7PfvwAfSd1t

Variable elimination

• When we’ve pulled out all the redundant terms we get: 


• which contains far fewer terms than the original expression.
• In general, complexity is linear in the # of CPT entries.
• This method is called variable elimination.

– if # of parents is bounded, also linear in the number of nodes.
– the expressions are evaluated in right-to-left order (bottom-up in the network)
– intermediate results are stored
– sums over each are done only for those expressions that depend on the variable


• Note: for multiply connected networks, variable elimination can have exponential
complexity in the worst case.

p(b|o) = ↵
X

w

p(w)p(o|w, b)
AAACDXicbZDLSgMxFIYzXmu9jbp0E6xCC1JmRFAEoeDGZQV7gc4wZNK0Dc1MQpKxlGlfwI2v4saFIm7du/NtTNtZaOuBhI//P4fk/KFgVGnH+baWlldW19ZzG/nNre2dXXtvv654IjGpYc64bIZIEUZjUtNUM9IUkqAoZKQR9m8mfuOBSEV5fK+HgvgR6sa0QzHSRgrsY1EMR7wEr6GHmOgh6KkkCgZQFAclc/HR4DQsBXbBKTvTgovgZlAAWVUD+8trc5xEJNaYIaVariO0nyKpKWZknPcSRQTCfdQlLYMxiojy0+k2Y3hilDbscGlOrOFU/T2RokipYRSazgjpnpr3JuJ/XivRnUs/pbFINInx7KFOwqDmcBINbFNJsGZDAwhLav4KcQ9JhLUJMG9CcOdXXoT6Wdk1fHdeqFxlceTAITgCReCCC1ABt6AKagCDR/AMXsGb9WS9WO/Wx6x1ycpmDsCfsj5/AFJvmcg=AAACDXicbZDLSgMxFIYzXmu9jbp0E6xCC1JmRFAEoeDGZQV7gc4wZNK0Dc1MQpKxlGlfwI2v4saFIm7du/NtTNtZaOuBhI//P4fk/KFgVGnH+baWlldW19ZzG/nNre2dXXtvv654IjGpYc64bIZIEUZjUtNUM9IUkqAoZKQR9m8mfuOBSEV5fK+HgvgR6sa0QzHSRgrsY1EMR7wEr6GHmOgh6KkkCgZQFAclc/HR4DQsBXbBKTvTgovgZlAAWVUD+8trc5xEJNaYIaVariO0nyKpKWZknPcSRQTCfdQlLYMxiojy0+k2Y3hilDbscGlOrOFU/T2RokipYRSazgjpnpr3JuJ/XivRnUs/pbFINInx7KFOwqDmcBINbFNJsGZDAwhLav4KcQ9JhLUJMG9CcOdXXoT6Wdk1fHdeqFxlceTAITgCReCCC1ABt6AKagCDR/AMXsGb9WS9WO/Wx6x1ycpmDsCfsj5/AFJvmcg=AAACDXicbZDLSgMxFIYzXmu9jbp0E6xCC1JmRFAEoeDGZQV7gc4wZNK0Dc1MQpKxlGlfwI2v4saFIm7du/NtTNtZaOuBhI//P4fk/KFgVGnH+baWlldW19ZzG/nNre2dXXtvv654IjGpYc64bIZIEUZjUtNUM9IUkqAoZKQR9m8mfuOBSEV5fK+HgvgR6sa0QzHSRgrsY1EMR7wEr6GHmOgh6KkkCgZQFAclc/HR4DQsBXbBKTvTgovgZlAAWVUD+8trc5xEJNaYIaVariO0nyKpKWZknPcSRQTCfdQlLYMxiojy0+k2Y3hilDbscGlOrOFU/T2RokipYRSazgjpnpr3JuJ/XivRnUs/pbFINInx7KFOwqDmcBINbFNJsGZDAwhLav4KcQ9JhLUJMG9CcOdXXoT6Wdk1fHdeqFxlceTAITgCReCCC1ABt6AKagCDR/AMXsGb9WS9WO/Wx6x1ycpmDsCfsj5/AFJvmcg=AAACDXicbZDLSgMxFIYzXmu9jbp0E6xCC1JmRFAEoeDGZQV7gc4wZNK0Dc1MQpKxlGlfwI2v4saFIm7du/NtTNtZaOuBhI//P4fk/KFgVGnH+baWlldW19ZzG/nNre2dXXtvv654IjGpYc64bIZIEUZjUtNUM9IUkqAoZKQR9m8mfuOBSEV5fK+HgvgR6sa0QzHSRgrsY1EMR7wEr6GHmOgh6KkkCgZQFAclc/HR4DQsBXbBKTvTgovgZlAAWVUD+8trc5xEJNaYIaVariO0nyKpKWZknPcSRQTCfdQlLYMxiojy0+k2Y3hilDbscGlOrOFU/T2RokipYRSazgjpnpr3JuJ/XivRnUs/pbFINInx7KFOwqDmcBINbFNJsGZDAwhLav4KcQ9JhLUJMG9CcOdXXoT6Wdk1fHdeqFxlceTAITgCReCCC1ABt6AKagCDR/AMXsGb9WS9WO/Wx6x1ycpmDsCfsj5/AFJvmcg=

General inference questions in Bayesian networks

• For queries in Bayesian networks, we divide variables into three classes:
– evidence variables: e = {e1, . . . , em} what you know
– query variables: x = {x1, . . . , xn} what you want to know
– non-evidence variables: y = {y1, . . . , yl} what you don’t care about

• The complete set of variables in the network is {e ∪ x ∪ y}.
• Inferences in Bayesian networks consist of computing p(x|e), the posterior probability of

the query given the evidence: 


• This computes the marginal distribution p(x,e) by summing the joint over all values of y.
• Recall that the joint distribution is defined by the product of the conditional pdfs: 





where the product is taken over all variables in the network.

p(z) =
!

i=1

P (zi|parents(zi))

p(x|e) =
p(x, e)

p(e)
= α p(x, e) = α

!

y

p(x, e, y)

Another approach: Simplify model using clustering algorithms

• Inference is efficient if you have a polytree, ie a singly connected network.
• But what if you don’t?
• Idea: Convert a non-singly connected network to an equivalent singly connected

network.

Cloudy

RainSprinkler

Wet Grass

??

??

??

What should go into the nodes?

Clustering or join tree algorithms

Cloudy

RainSprinkler

Wet Grass

P(C)

0.5

C P(S|C)

T 0.1

F 0.5

C P(R|C)

T 0.8

F 0.2

S R P(W|S,R)

T T 0.99

T F 0.9

F T 0.9

F F 0.01

Cloudy

Spr+Rain

Wet Grass

P(C)

0.5

P(S+R=x|C)

C TT TF FT FF

T 0.08 0.02 0.72 0.18

F 0.1 0.4 0.1 0.4

S+R P(W|S+R)

T+T 0.99

T+F 0.9

F+T 0.9

F+F 0.01

Idea: merge multiply
connected nodes into

a single, higher-
dimensional case.

Can take exponential
time to construct CPTs

But approximate
algorithms usu. give

reasonable solutions.

So what do we do?

• The are special cases of Bayes nets for which there are fast, exact algorithms:
– variable elimination
– belief propagation

• There are also many approximations:
– stochastic (MCMC) approximations
– approximations

The complexity of multi-cause models

• In the models above, we specified the
joint conditional density by hand.

• This specified the probability of a
variable given each possible value of the
causal nodes.

• Can this be specified in a more generic
way?

• Can we avoid having to specify every
entry in the joint conditional pdf?

• For this we need to specify:
P(X | parents(X))

• The number of parameters (table
entries) scales exponentially with the
number of causes

open door

burglarwife

W B P(O|W,B)

F F 0.01

F T 0.25

T F 0.05

T T 0.75

Beyond tables: modeling causal relationships using Noisy-OR

• We assume each cause Cj can produce
effect Ei with probability fij.

• The noisy-OR model assumes the
parent causes of effect Ei contribute
independently.

• The probability that none of them
caused effect Ei is simply the product of
the probabilities that each one did not
cause Ei.

• The probability that any of them caused
Ei is just one minus the above, i.e.

P (Ei|par(Ei)) = P (Ei|C1, . . . , Cn)

= 1 −
!

i

(1 − P (Ei|Cj))

= 1 −
!

i

(1 − fij)

catch cold (C)

touch contaminated
object (O)

hit by viral
droplet (D)

f
CD

f
CO

eat contaminated
food (F)

f
CF

1 − (1 − fCD)(1 − fCO)(1 − fCF )

P (C|D, O, F ) =

Another noisy-OR example

ADAPTIVE PROBABILISTIC NETWORKS WITH HIDDEN VARIABLES 229

Table 2. Conditional probability table for P (Fever |Cold,Flu,Malaria), as calculated from the noisy-OR model.

Cold Flu Malaria P (Fever) P (¬Fever)

F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2� 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6� 0.1
T T F 0.88 0.12 = 0.6� 0.2
T T T 0.988 0.012 = 0.6� 0.2� 0.1

6.2. The noisy-OR model

Noisy-OR relationships generalize the logical notion of disjunctive causes. In propositional
logic, we might say Fever is true if and only if at least one of Cold, Flu, orMalaria is true.
The noisy-OR model adds some uncertainty to this strict logical description. The model
makes three assumptions. First, it assumes that each of the listed causes (Cold, Flu, and
Malaria) causes the effect (Fever), unless it is inhibited from doing so by some other cause.
Second, it assumes that whatever inhibits, say, Cold from causing a fever is independent
of whatever inhibits Flu from causing a fever. These inhibitors cause the uncertainty in
the model, giving rise to “noise parameters.” If P (Fever |Cold,¬Flu,¬Malaria) = 0.4,
P (Fever | ¬Cold,Flu,¬Malaria) = 0.8, andP (Fever | ¬Cold,¬Flu,Malaria) = 0.9, then
the noise parameters are 0.6, 0.2, and 0.1, respectively. Finally, it assumes that all the
possible causes are listed; that is, if none of the causes hold, the effect does not happen.
(This is not as strict as it seems, because we can always add a so-called leak node that covers
“miscellaneous causes.”)
These assumptions lead to a model where the causes—Cold, Flu, and Malaria—are the

parent nodes of the effect—Fever. The conditional probabilities are such that if no parent
node is true, then the effect node is false with 100% certainty. If exactly one parent is true,
then the effect is false with probability equal to the noise parameter for that node. In general,
the probability that the effect node is false is just the product of the noise parameters for
all the input nodes that are true. For the Fever variable in our example, we have the CPT
shown in Table 2.
The conditional probabilities for a noisy-OR node can therefore be described using a

number of parameters linear in the number of parents rather than exponential in the number
of parents. Let qip be the noise parameter for the pth parent node of Xi—that is, qip is the
probability that Xi is false given that only its pth parent is true, and let Tik denote the set
of parent nodes ofXi that are set to T in the kth assignment of values for the parent nodes
Ui. Then

wijk =
⇥ �

p�Tik qip if xij = F
1�


p�Tik qip if xij = T

. (6)

A more complex model with noisy-OR nodes
ADAPTIVE PROBABILISTIC NETWORKS WITH HIDDEN VARIABLES 231

lights

no oil no gas
starter
broken

battery age alternator
broken

fanbelt
broken

battery
dead no charging

battery
flat

engine won’t
startgas gauge

fuel line
blocked

oil light

Figure 6. A network with noisy-OR nodes. Each node with multiple parents has a CPT described by the noisy-OR
model (Equation 6). The network has 22 parameters, compared to 59 for a network that uses explicitly tabulated
conditional probabilities.

1.4

1.6

1.8

2

2.2

2.4

0 100 200 300 400 500

A
ve

ra
ge

n
eg

at
iv

e
lo

g
lik

el
ih

oo
d

pe
r c

as
e

Number of training cases

Explicit CPT network
Noisy!OR network

Target network

Figure 7. The output prediction error Ĥ(D0, Pw) as a function of the number of cases observed, for data generated
from the network shown in Figure 6. The plot shows learning curves for the APN algorithm using a noisy-OR
representation and using explicit CPTs.

• Could either model causes
and effects

• Or equivalently stochastic
binary features.

• Each input xi encodes the
probability that the ith binary
input feature is present.

• The set of features
represented by ϕj is defined
by weights fij which encode the
probability that feature i is an
instance of ϕj.

A general one-layer causal network

b c

a

Each column is a distinct eight-dimensional binary feature.

The data: a set of stochastic binary patterns

There are five underlying causal feature patterns.
What are they?

b c

a

Each column is a distinct eight-dimensional binary feature.

b c

a

true hidden causes of the data

The data: a set of stochastic binary patterns

This is a learning problem, which
we’ll cover in later lecture.

Hierarchical Statistical Models
A Bayesian belief network:

pa(Si)

Si

D

The joint probability of binary states is

P (S|W) =

i

P (Si|pa(Si),W)

The probability Si depends only on its parents:

P (Si|pa(Si),W) =

h(

j Sjwji) if Si = 1
1� h(


j Sjwji) if Si = 0

The function h specifies how causes are
combined, h(u) = 1� exp(�u), u > 0.

Main points:
• hierarchical structure allows model to form

high order representations
• upper states are priors for lower states
• weights encode higher order features