CPS721: Bayesian Networks (BN)
Acknowledgement: several slides are based on
S.Thrun’s slides for CS221 (AI course at Stanford University), S.Russell and P.Norvig’s slides for Chapters 13 and 14 of their textbook “AI: Modern Approach”
Copyright By PowCoder代写 加微信 powcoder
F.Jensen and Th.Nielsen’s slides to Ch.2 of their book “Bayesian Networks and Decision Graphs”
Review of basic probability
Usually, we write a random variable starting with a capital, e.g., “Coin”. Probability of tail on a fair coin P(Coin=tail) = 0.5 or shorter P(tail)=0.5
If you have a fair die, then each number {1, 2, 3, 4, 5, 6} has an equal probability to occur. We say a roll of the die is a random variable D. Let P(D = i) be the probability a roll comes up i, where i = 1,2,3,4,5,6. Because the die is fair, then P(D=i) is 16 for each i = 1,2,3,4,5,6.
Q1. Suppose that a fair die is thrown once. What is the probability of getting an even number on a die ?
A1. All 6 numbers are equally likely, and there are 3 even numbers {2, 4, 6}. Since, the events of throwing a number are mutually exclusive, the probability of getting an even number is P(D=2) + P(D=4) + P(D=6) = 36 = 0.5
Q2. Ann and Bob each roll a fair die. Ann wins if the product of the numbers is even. Bob wins if the product is odd. Suppose they play 100 games. How often is Ann likely to win? Explain.
A2. There are 3 favorable outcomes for Ann. Both players roll even numbers. Or she rolls an even number and Bob rolls an odd number. Otherwise, Bob rolls an even number, but she rolls an odd number. Since the total number of outcomes of each game is 4, and each outcome is equally likely, Ann is likely to win in 34 ·100 = 75% of games.
Motivation: famous applications
Bayesian Networks (BN) have a number of practical applications.
(1) Microsoft Office assistant (in the 1990s): uses about 100 binary random variables. The probability table determined by 2100 different outcomes is too big. It is impossible to keep it in memory.
(2) Munin BN for diagnosing neuro-muscular diseases has about 1000 random variables with complicated dependencies. Note that the joint probability distribution for 1000 binary random variables requires 21000 parameters to be fully specified !
(3) Junk email filters use BNs to filter our spam.
(4) Automated credit card fraud detection in the financial industry.
Basics of probability: continued
Q3. Suppose that 2 fair dice D1 and D2 are rolled simultaneously, but independently. What is the probability of getting even numbers on both?
A3. Since both dice are fair and independent, the probability of rolling even numbers on both is the product of probabilities of rolling an even number on each, i.e., 12 · 12 = 41
In other words, if we count the number of favorable combinations of even numbers {2, 4, 6} on the first die and even numbers {2, 4, 6} on the second die, then we see there are 3 · 3 = 9 of favorable combinations. In total, there are 6 · 6=36 possible outcomes on two dice. Therefore, the probability of getting even numbers on both dice is 9 = 1 .
Q4. Suppose that 4 fair dice D1, D2, D3, D4 are rolled simultaneously and independently. What is the probability of getting odd numbers on all?
A4. Since all dice are fair and independent, the probability of rolling odd
numbers on all is the product of probabilities of rolling an odd number on
each,i.e.,1·1·1·1=1 =1 2 2 2 2 24 16
In other words, the favourable outcomes on each are {1, 3, 5} and their probability is 12 . Therefore, there are 3 · 3 · 3 · 3 = 81 favorable combinations, while the total number of outcomes is 6 · 6 · 6 · 6 = 36 · 36 = 1296. Since all
outcomes are equally likely, the probability of odd numbers on all is 81 = 1 1296 16
Independent Events
Multiplicative (Fundamental Counting) Principle for Independent Events. The probability of two independent events, A and B, occurring is
P(A and B) = P(A) · P(B)
Similar principle also applies to two independent random variables X and Y .
The probability that a random variable X takes a value x and an independent random variable Y takes a value y
P(X=x andY=y) is P(X=x)·P(Y=y).
Example: Consider tossing independently n (possibly unfair) coins. The joint
probability that all of them give, say, heads, can be computed as P(Coin1=H,Coin2=H,…,Coinn=H) =
P (Coin1 = H ) · P (Coin2 = H ) · . . . · P (Coinn = H ).
Note that instead of 2n − 1 numbers we need only n numbers to define this joint probability distribution, huge savings. But this is a very special case, since all n random vairables are independent. It hints that we should be looking for clusters for variables that are independent from each other, if we would like to compute joint probabilities efficiently.
Marginal probability: example
Let a random variable Ahave outcomes {a1, a2} and a random variable B have outcomes {b1 , b2 , b3 }. All combinations (A = ai , B = bj ) are mutually exclusive and the sum i,j P(A=ai, B=bj) = 1
We have P(A, B) and we need P(A).
a1 0.16 0.10 0.14 → 0.4 a2 0.24 0.28 0.08 → 0.6
B is marginalized out of P(A, B):
”A = a1 ” = (”A = a1 ” ∧ ”B = b1 ”) ∨ (”A = a1 ” ∧ ”B = b2 ”) ∨ (”A = a1 ” ∧ ”B = b3 ”) P(A = a1) = 0.16 + 0.10 + 0.14 = 0.4
”A=a2” = (”A=a2” ∧ ”B=b1”) ∨ (”A=a2” ∧ ”B=b2”) ∨ (”A=a2” ∧ ”B=b3”) P(A = a2) = 0.24 + 0.28 + 0.08 = 0.6
Example: Test for a cancer
Let a random variable Cancer (C) take values from the set {yes, no}. Let a random variable TestPositive (TP) takes values from {yes, no}. Suppose that
P(C =yes) = 0.02 and P(C =no) = 0.98
Consider joint probability distribution (j.p.d.) represented in a table P(C,TP): Cancer TestPositive P(C,TP)
yes yes yes no
The numbers are collected from statistics, purchases, records, incidents, etc.
The marginal probability P(TP=yes) = P(TP=yes,C=yes)+P(TP=yes,C=no) = 0.018 + 0.196= 0.214
Marginalization in general
If P(X ̄,Y ̄) is a j.p.d. over two sets of random variables X ̄,Y ̄, and x ̄,y ̄ are
assignments of values to X , Y , then the marginal probability of Y = y ̄ is ̄ ̄ ̄
P(Y ̄=y ̄) = P(X ̄=x ̄,Y ̄=y ̄). x ̄
Potential computational problem: if we have n boolean random variables, then we need (2n − 1) numbers to specify joint distributions (j.p.d.)
Conditional Probability
Informally, conditional probability P(A | B) is the probability of a random variable (r.v.) A in the context that a random variable B happens.
P(Die=2|Die=even) = P(Die = 2,Die = even) = 1/6 = 1.
P(Die = even) 1/2 3 For two random variables A, B, if the j.p.d. P(A, B) is directly given, then
P(A|B)= P(A,B). P(B)
In a more complex case, if a j.p.d. P(X ̄ , Y ̄ ) includes many random variables, P(x1,…,xm|y1,…,yn) = P(x1,…,xm,y1,…,yn)
If m, n are large, then computing the right hand side is computationally challenging. How can we simplify this computation?
Hint: use independence between r.v.
P(y1,…,yn)
Test for cancer
To describe the cancer test (TP) under condition of C we consider P(TestPositive = yes|Cancer = yes) = 0.9
P(TestPositive = yes|Cancer = no) = 0.2 /* false positive */
Then we can calculate the probability of “false negative”
P(TP = no|C = yes) = 1 − P(TP = yes|C = yes) = 1 − 0.9 = 0.1 Similarly, we can calculate
P(TP = no|C = no) = 1 − P(TP = yes|C = no) = 1 − 0.2 = 0.8.
Note that using prior probabilities P(C = yes) = 0.02 and P(C = no) = 0.98 we can compute joint probabilities from conditional probabilities as
P(TP, C) = P(TP|C) ∗ P(C)
Joint probabilities are useful because using them we can answer diagnostic
questions: How likely cancer is given a positive test, P(C =yes|TP =yes) ? From the definition of conditional probability,
P(C=yes|TP=yes) = P(C=yes,TP=yes) = 0.018 = 0.084 P(TP=yes) 0.214
where 0.214 was found as P(TP =yes, C =yes) + P(TP =yes, C =no)= 0.018 + 0.196 = 0.214 using marginalization from the j.p.d. for C and TP.
Dental Example (Russell and Norvig, Ch.13)
Let sample space for a r.v. Weather = {sunny , rain, cloudy , snow }.
A random variable Cavity (do I have a cavity) is boolean, i.e., has 2 possible values: {true, false}. A random variable Catch (a dental steel probe catches in a tooth), and a random variable Toothache are also boolean.
By the definition of conditional probability, P(Toothache,Catch,Cavity,Weather) = /* condition on Weather */
P(Toothache,Catch,Cavity|Weather)·P(Weather) = P(Toothache,Catch,Cavity)·P(Weather).
because events in a dental office do not depend on weather outside.
As a consequence, instead of 32-1=31 entries needed to define a joint probability distribution for random variables Weather,Cavity,Catch,Toothache together, we need only 12. Namely, we need 4 entries for Weather and 8 entries for boolean random variables. This is an important simplification thanks to independence.
Note we actually need only 7+3=10 numbers, since the remaining can be computed from the reciprocal law (1 − P_of_all_other_entries).
NotethatP(Toothache,Catch,Cavity)has23 −1=7independent parameters. Can we reduce this number ?
A Simple Bayes Network
The last example example demonstrates a simple Bayes Network with two vertices and one directed edge.
Prior P(C) and conditional probability P(TP|C) are called the model. Calculating P(TP) in the forward direction from C is called prediction. Calculating P(C|TP) in the backward direction from TP is called diagnosis.
Note that if the arrow between C and TP would be absent, then these r.v. would be understood as independent. If they are independent, then
P(C, TP) = P(C) · P(TP). What does this mean for the test?
==> Don’t take it, since it is irrelevant.
Dental Example: Cont.
Apply again the definition of conditional probability:
P(Toothache, Catch, Cavity) = P(Catch|Toothache, Cavity) ∗ P(Toothache, Cavity) How can we simplify this expression?
Notice that P(Catch|Toothache,Cavity) = P(Catch|Cavity), i.e., the probability that a dentist tool is caught somewhere given cavity and tooth ache does not depend on whether I have tooth ache or not, but depends only on the fact I have a cavity.
Similarly, P(Toothache, Cavity) = P(Toothache|Cavity) ∗ P(Cavity).
Summarizing, we see that
P(Toothache, Catch, Cavity) =
P(Catch|Toothache, Cavity) ∗ P(Toothache, Cavity) = P(Catch|Cavity) ∗ P(Toothache|Cavity) ∗ P(Cavity).
The last distribution can be defined with only 2+2+1=5 numbers.
For example, P(Catch|Cavity) can be defined using 2 numbers only: P(Catch = y|Cavity = t) and P(Catch = y|Cavity = f), because the remaining probabilites can be calculated from the reciprocal law. Namely, P(Catch = n|Cavity = t) = 1 − P(Catch = y|Cavity = t) and
P(Catch = n|Cavity = f) = 1 − P(Catch = y|Cavity = f).
We need only one number to define P(Cavity =t), since P(Cavity =f) = 1 − P(Cavity =t)
Dental Bayesian Network
Cavity Weather
Catch Toothache
Topology of this Bayesian network encodes conditional independence assertions. For example, the r.v. Weather and other r.v. are independent, since there are no edges between them. Directed edges represent cause-effect connections.
Thanks to conditional independence, the j.p.d. for these 4 r.v. can be defined using only 8 numbers that determine P(Weather), P(Catch|Cavity), P(Toothache|Cavity), and prior probability P(Cavity). This is significant savings in comparison to 31 numbers that would be needed to determine the j.p.d. without taking independence into account.
Now, we can generalize the lessons we learned from this example.
Bayesian networks
A Bayesian Network (BN) is a directed acyclic graph. Each node is a random variable with a finite set of values that are mutually exclusive and exhaustive. The links represent cause – effect relations.
The 1st is OK, but the 2nd and 3rd are not OK due to cycles.
For each r.v. A with parents B1 , . . . , Bn there is a conditional probability table
P(A|B1, . . . , Bn). Note: Nodes without parents receive a prior distribution. The chain rule for a BN is more compact than the general chain rule for
arbitrary r.v. A1, . . . , An, i.e., without independence encoded in a BN.
Consider a BN over random variables A1 , . . . , An ,. Then, their j.p.d. P(A1,…,An) = i P(Ai | Parents(Ai)),
where Parents(Ai ) are the random variables in the nodes which are the parents of the node with a r.v. Ai .
Conditional Independence.
Definition.
Random variable X is conditionally independent of a random variable Y given Z,ifP(X =x|Y =y,Z =z)=P(X =x|Z =z)forallassignmentofvalues x,y,z to random variables X,Y,Z.
To compute a j.p.d. efficiently we can repeatedly apply the Chain Rule P(X1, X2, …, Xn) = P(X1|X2, …, Xn) ∗ P(X2, …, Xn)
and keep expanding the RHS by introducing conditional probabilities. Why this works?
We hope Conditional Iindependence will facilitate reduction in dependences and make the resulting expression manageable.
In many practical cases, the use of conditional independence reduces the size of the representation of the joint probability distribution from exponential in the number of variables n to linear in n.
Summary: Conditional independence is our most basic and robust tool for answering queries about uncertain environment with only limited evidence available.
Example: calculate j.p.d. P(A, B, C, D, E)
Use a bottom-up algorithm for expanding j.p.d. in terms of cond. probabilities.
P(A,B,C,D,E)
=P(E|A,B,C,D)·P(A,B,C,D)
= P(E|C) · P(A, B, C, D)
= P(E|C) · P(D|A, B, C) · P(A, B, C)
= P(E|C) · P(D|B, C) · P(A, B, C)
= P(E|C) · P(D|B, C) · P(C|A, B) · P(A, B)
= P(E|C) · P(D|B, C) · P(C|A) · P(B, A)
= P(E|C) · P(D|B, C) · P(C|A) · P(B|A) · P(A)
How many numbers do we need to define P(A, B, C, D, E) if all r.v. are binary?
We need 1 number to define P(A), 2 numbers to define P(B|A), 4 numbers to define P(D|B,C), 2 numbers to define P(C|A) and 2 more for P(E|C), i.e., 1+2+4+2+2=11 numbers in total vs. 25 − 1 = 31 without a BN structure.
Example: calculate diagnosis P(A = t|D = t,E = t) Suppose we cannot observe A directly, but we have observed D and E. How
can we calculuate the diagnostic probability P(A=true|D=true,E=true) ? By the definition of conditional probability
P(A = t|D = t,E = t) = P(A = t,D = t,E = t). P(D = t,E = t)
How can we compute the probabilities in the numerator and denominator? Use marginalization applied to the j.p.d. from the previous slide:
B,C P(A = t,B,C,D = t,E = t) P(A=t|D=t,E=t)= A,B,CP(A,B,C,D=t,E=t)
The sum in the nominator has 4 terms, and the sum in the denominator has 8 terms if the random variables A, B, C, D, E are boolean, since each bollean r.v. takes 2 different values.
Correct and Efficient Probabilistic Inference
Our ultimate goal is how to answer correctly and efficiently the probabilistic queries: given some Evidence, how can we compute a conditional probability for a Query random variable, i.e., P(Query|Evidence) ?
In a general case, both Query and Evidence can be tuples of random variables. Reasoning can go forward (prediction), or backwards (diagnosis).
Let us assume we have a joint probability distribution P(Query,Evidence,Other), where Other is a tuple of random variables defining the context of our inference problem.
To compute an unconditional probability of Evidence, we use marginalization: P(Evidence) = Other,Query P(Query,Evidence,Other).
As we know already from the definition of conditional probability,
P(Query|Evidence) = Other P(Query,Evidence,Other) . Query,Other P(Query,Evidence,Other)
Thus, to compute a conditional probability, we need to compute a joint probability efficiently. This is where the BN become very useful and practical.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com