Bayesian Networks
[AIMA 4G]
Chapter 13.1-13.3, 13.5
[13.2.3, 13.2.4, 13.4]
Artificial Intelligence
COSC1127/1125
Semester 2, 2021
Prof.
* Many slides are based on those and , and some based on ‘s ones. Thanks to all them!
Wominjeka!
Week 10
Bayesian Networks
Prof.
What are the chances that there was an earthquake given that the alarm went off and Sophia did not call?
Beds are burning (cover) – Midnight Oil 1987
20 years from “Beds are burning” in
“After Midnight Oil toured through the Outback in 1986, playing to remote Aboriginal communities and seeing first hand the seriousness of the issues in health and living standards, , and wrote “Beds Are Burning” to criticize how said populations were often forcibly removed from their lands, highlighted by the pre-chorus lines “it belongs to them, let’s give it back”. – Wikipedia
Artificial Intelligence
8 Markov Decision Processes
9 Reinforcement Learning
10 Bayesian Reasoning
11 Recap: Prob+BN, MDP+RL
11 Practical Reasoning & Philosophical AI, AI @ RMIT
*
Some news…
Preliminary submission passed…
We are processing results and contributions.
Remember to certify each member.
THE has been marked, results soon
More in next slide
Bonus Project 3 available (individual)
Final Contest due Week 12
Agent system (Wiki+Video report later)
Wiki template to be provided soon.
CES Survey running now
Please fill it if you are participating actively. 🙏 😃
*
THE feedback: “Tough but fair”
After dropping 3’s: 85% positive (4+5) vs 15% negative (1+2)
“Tough”: marks must be earned; higher levels in Bloom’s hierarchy.
“Fair”: within what was covered; lots of small bits; depth+breadth
Test had 4 “theoretical” questions/parts.
73 points = 66 core + 7 bonus
Partial marks given as deemed suitable for each question.
almost every question was awarded partial marks.
but not via bean counting!
Average was 52% (with max 100% and 14% HDs)
Comparable with 2020 (but they did it in 3hrs)
You will get an email with results per question.
Final mark (out of 100%) = score / 66
Final mark will be uploaded to Canvas
Thanks for your professionalism (& humor!)
THE summary (preliminary!)
Extra 1-on-1 support for Pacman Project
Want 1-on-1 team support for your project?
Book 1-on-1 team 15’ consultation time on Wednesdays 3pm-4pm, weeks 9 to 12.
Book here!
When you book:
Make a PRIVATE post before to inform us what you want to chat.
Include the link to your repo.
Use the 1-on-1 tutorial consultation time!
Remember 1-on-1 Mondays 4pm-5pm Consultation Time for tutorials, use that & book it HERE!
Course Survey is live
Results have big impact on:
Teacher: promotion, resources, course assignments, teaching awards, etc.
Prospective students who may take the course!
If you like the course, please do not just ignore it. This is where it counts!
I want to know your opinion too…
CES: The Two Questions
What would you tell a friend who is interested in the course?
Do you think you will be a better Computer Scientist after taking AI?
Questions?
*
Today…
IMPORTANT DISCLAIMER
These pack of slides are NOT core study
material. The core & main material is the book or any specific resources given for the week.
The slides in this presentations were prepared only for supporting the instructor during the lectorial and only some of them have been used in the session as needed.
These pack of slides do not represent the complete material of the week, but only a few bits here and there around what was discussed in lectorials. Most of the week content will NOT show up in this pack of slides.
Please refer to the book (and any resources specifically given) for a comprehensive coverage of the week material.
probabilities…
‹#›
Recap from probabilities: Week 7
“When it is not in our power to determine what is true, we ought to follow what is most probable.” ― René Descartes
Sample space & events.
Probability functions.
Random variables.
Probability distributions.
Joint distributions.
Conditional probabilities.
Inference:
Enumeration/marginalization.
Bayes rule.
An Aside on Notation
P(X) : distribution over X.
P(X | Y): family of conditional distributions over X, one for each y ∊ Dom(Y).
Distinguish between:
P(X): a distribution (over the range domain of var X)
P(x) or P(¬x) (or P(xi) for nonboolean): numbers.
Think of P(X|Y) as a function that accepts any xi and yk and returns P(xi | yk).
Conditional Probability
Definition:
P(A | B) = P(A Λ B) / P(B)
P(A Λ B) and P(B): by marginalization
Chain/Product rule:
P(A Λ B) = P(A | B) P(B)
Bayes Rule: Update beliefs in H after evidence e:
Know these!
Chain rule
P(A Λ B Λ C) = P(A, B, C) = P(A)*P(B|A)*P(C|A,B)
We know that:
P(B|A) = P(AΛB) / P(A)
P(C|A,B) = P(AΛBΛC) / P(AΛB)
So:
P(A Λ BΛ C) = P(A,B,C) =
P(A)*(P(AΛB) / P(A))*(P(AΛBΛC) / P(AΛB))
P(A Λ B Λ C) =
P(A)*(P(A Λ B) / P(A))*(P(A Λ B Λ C) / P(A Λ B))
Bayes Rule: CP + Chain Rule
P(A|B) = P(AΛB) / P(B) [ Def Cnd Prob]
P(AΛB) = P(B|A)*P(A) [ Chain/Prod Rule]
So: P(A|B) = P(B|A)*P(A) / P(B)
A = the cause or hypothesis (illness)
B = the effect or evidence (synthom)
Deriving Bayes Rule: another way!
P(AΛB) = P(A|B)*P(B) [ Chain/Prod Rule]
P(AΛB) = P(B|A)*P(A) [ Chain/Prod Rule]
So:
P(B|A)*P(A) = P(A|B)*P(B)
Hence:
P(A|B) = P(B|A)*P(A) / P(B)
Deriving Bayes Rule
P(Cause | Effect) =
P(Effect|Cause)*P(Cause) / P(Effect)
We generally can easily get P(Effect|Cause) & P(Cause) via stat analysis of examples
But what about P(Effect)? IMPLICIT!
P(E) = P(E|C)*P(C) + P(E|¬C)*P(¬C)
Independence
Generally we want to compute query variables Y given evidence variables E = e
There may be hidden (i.e. unobserved) vars H
H = U – Y – E
If we had the joint probability distribution then could marginalize (the hidden variables):
P(Y | E=e) = P(Y Λ E=e) / P(E = e)
= Σh P(Y Λ E=e Λ H=h) /
Σh,y P(Y=y Λ E=e Λ H=h)
Computing Conditional Probabilities
Inference: Computational Bottleneck
Semantically/conceptually, picture is clear; but several issues must be addressed…
Issue 1
How do we specify the full joint distribution over a set of random variables X1, X2,…, Xn ?
e.g., what is the value of P(X1 Λ ¬X2 Λ …. Λ ¬Xn-1 Λ Xn)?
Exponential number of possible worlds.
e.g., if the Xi are boolean, then 2n numbers (or 2n-1 parameters/degrees of freedom, since they sum to 1)
These numbers are not robust/stable.
These numbers are not natural to assess (what is probability that “Sebastian wants a cup of tea; it’s not raining or snowing in Melbourne; robot charge level is low; …”?).
Issue 2
Inference in this representation is frightfully slow.
To calculate P(a) we must either:
Sum over exponential number of worlds (enumeration/marginalization); or
Sum over each possible evidence e to determine
P(α | E=e).
The Dentistry Example
Toothache: reported toothache.
Cavity: cavity is present.
Catch: feeling a ‘‘catch’’ with a dental probe.
Sunny: weather is sunny.
*
Calculating Distributions
toothache ¬toothache
catch ¬catch catch ¬catch
cavity 0.108 0.012 0.072 0.008
¬ cavity 0.016 0.064 0.144 0.576
P(cavity) = 0.108 + 0.012 + 0.072 + 0.008
P(cavity Λ toothache) = P(cavity Λ toothache | catch) + P(cavity Λ toothache | ¬catch)
Issue 1: how to get this big table?
Issue 2: lots of sums…
*
Is there anything we can do?
How do we avoid these two problems?
no solution in general;
but in practice there is structure we can exploit.
We’ll use: 💪
independence;
conditional independence;
Bayesian networks.
Independence
A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A,B) = P(A)P(B)
Cavity
Toothache
Catch
Weather
Cavity
Toothache
Catch
Weather
P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather)
P(Toothache, Catch, Cavity | Weather) = P(Toothache, Catch, Cavity)
intuitively, learning B doesn’t influence beliefs about A
*
Conditional Independence
Independence is very strong and hence rare…
Conditional independence is more common
Consider P(catch | toothache, cavity)
depends on
but not on
P(catch | toothache, cavity) = P(catch| cavity )
P(catch | toothache, ¬ cavity) = P(catch| ¬ cavity)
P(Catch | Toothache, Cavity) = P(Catch| Cavity )
*
Conditional Independence
P(Catch | Toothache, Cavity) = P(Catch | Cavity)
Similarly:
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) =
P(Toothache | Cavity) P(Catch | Cavity)
Catch is conditionally independent of Toothache given Cavity
*
Conditional Independence
x and y are conditionally independent given z iff:
P(x|yz) = P(x|z) OR
P(y|xz) = P(y|z) OR
P(xy|z) = P(x|z)P(y|z)
learning y doesn’t influence your beliefs about x if you already know z
Example: learning someone’s mark on AI’s exam can influence the probability you assign to a specific GPA; but if you already knew the final AI’s grade, learning the exam mark would not influence your GPA assessment
Conditional Independence
P(Toothache, Catch, Cavity) =
P(Toothache | Catch, Cavity) P(Catch, Cavity) =
P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)
Because of conditional independence this becomes
P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)
Reduces the size of
the joint distribution
*
What good is independence?
Suppose (say, boolean) variables X1, X2,…, Xn are mutually independent.
We can specify full joint distribution using only n parameters (linear) instead of 2n -1 (exponential)!
How? Simply specify P(x1), …, P(xn)
From this we can recover the probability of any world or any (conjunctive) query easily
Recall P(x,y)=P(x)P(y) and P(x|y)=P(x) and P(y|x)=P(y)
Example
4 independent boolean random variables X1, X2, X3, X4
P(x1) = 0.4, P(x2) = 0.2, P(x3) = 0.5, P(x4) = 0.8
P(x1, ¬x2, x3, x4) = P(x1)(1-P(x2))P(x3)P(x4)
= (0.4)(0.8)(0.5)(0.8)
= 0.128
P(x1, x2, x3 | x4) = P(x1,x2,x3)
= P(x1)P(x2)P(x3)
= (0.4)(0.2)(0.5)
= 0.04
The Value of Independence
Complete independence reduces both representation of joint and inference from O(2n) to O(n)!!
Unfortunately, such complete mutual independence is very rare in most realistic domains.
Fortunately, most domains do exhibit a fair amount of conditional independence. We can exploit conditional independence for representation and inference as well.
Bayesian networks do just this!
Method of teaching:
Lectorials + Forum
Assessments:
Pacman Projects
Flexible THE
Technology:
EdStem + GitHub
Learning Materials:
Videos + book + tute
If you are active participant in the course (lectorials, tutes) please please don’t just ignore it, so I get a balanced view…
‹#›
Exploiting Conditional Independence
E
C
L
S
G
E – Sebastian woke up too early G – Sebastian is grumpy
S – Students are sad C – Sebastian needs coffee
L– The lecture did not go smoothly
If Sebastian woke up too early E, Sebastian probably needs coffee C; if Sebastian needs coffee, he’s likely grumpy G. If he is grumpy then it’s possible that the lecture won’t go smoothly L. If the lecture does not go smoothly then the students will likely be sad S.
Conditional Independence
E
C
L
S
G
If you learned any of E, C, G, or L, would your assessment of P(S) change?
If any of these are seen to be true, you would increase P(s) and decrease P(¬s).
So S is not independent of E, or C, or G, or L.
Conditional Independence
E
C
L
S
G
If you knew the value of L (true or false), would learning the value of E, C, or G influence P(S)?
Influence that these factors have on S is mediated by their influence on L.
Students aren’t sad because Sebastian was grumpy, they are sad because of the lecture.
So S is independent of E, C, and G, given L
Conditional Independence
E
C
L
S
G
So S is independent of E, and C, and G, given L
Similarly:
S is independent of E, and C, given G.
G is independent of E, given C.
This means that:
P(S | L, {G,C,E} ) = P(S|L)
P(L | G, {C,E} ) = P(L| G)
P(G| C, {E} ) = P(G | C)
P(C | E) and P(E) don’t “simplify
Conditional Independence
E
C
L
S
G
By the chain rule (for any instantiation of S…E):
P(S,L,G,C,E) =
P(S|L,G,C,E) P(L|G,C,E) P(G|C,E) P(C|E) P(E)
By our independence assumptions:
P(S,L,G,C,E) = P(S|L) P(L|G) P(G|C) P(C|E) P(E)
We can specify the full joint by specifying just five local conditional distributions:
P(S|L); P(L|G); P(G|C); P(C|E); and P(E)
Inference is Easy
E
C
L
S
G
What to know P(g)?
P(c) = P(c|e)P(e) + P(c|¬e)P(¬e)
= 0.8 * 0.7 + 0.5 * 0.3 = 0.78
P(¬c) = P(¬c|e)P(e) + P(¬c|¬e)P(¬e) = 0.22
P(¬c) = 1 – P(c), as well
P(g) = P(g|c)P(c) + P(g|¬c)P(¬c)
= 0.7 * 0.78 + 0.0 * 0.22 = 0.546
P(¬g) = 1 – P(g) = 0.454
Bayesian Networks
Bayesian Networks
This structure is a Bayesian network.
Graphical representation of the direct dependencies over a set of variables + a set of conditional probability tables (CPTs) quantifying the strength of those influences.
Bayes nets generalize the above ideas in very interesting ways, leading to effective means of representation and inference under uncertainty.
Bayesian Networks
Graphical notation for conditional independence statements.
aka belief networks, probabilistic networks
Compact specification of joint distributions.
One node per variable.
Directed acyclic graph.
Link means ‘directly influences’.
Node conditional on its parents: P(Xi | parents(Xi) )
*
`Dentist’ Network
Cavity
Toothache
Catch
Weather
Weather is independent of other variables.
Toothache and Catch are conditionally independent given Cavity.
*
Bayesian Networks
A
C
B
P(a)
P(¬a)
P(b)
P(¬b)
P(c|a,b) P(¬c|a,b)
P(c|¬a,b) P(¬c|¬a,b)
P(c|a,¬b) P(¬c|a,¬b)
P(c|¬a,¬b) P(¬c|¬a,¬b)
BN over variables {X1, X2,…, Xn} consists of:
a DAG whose nodes are the variables.
a set of CPTs (P(Xi | Parents(Xi) ) for each Xi
Semantics of a Bayes Net
The structure of the BN means: every Xi is conditionally independent of all of its non-descendants given its parents:
P(Xi | S ∪ Par(Xi)) = P(Xi | Par(Xi))
for any subset S ⊆ NonDescendants(Xi)
Burglary Network
`My neighbour John calls me to say the house alarm has gone off. My other neighbour Mary hasn’t rung. Minor earthquakes (!) could set off the alarm. Is there a burglar?’
Causal links:
Earthquake can set off alarm
Burglar can set off the alarm
Alarm can cause John to call
Alarm can cause Mary to call
*
Burglary Network
Query: P(B = true | A = true, M = false, J = false)?
… or, what is the same: P(b | a, ¬m, ¬j)?
Semantics of Bayes Nets
If we ask for P(x1, x2,…, xn) we obtain
assuming an ordering consistent with network
By the chain rule, we have:
P(x1, x2,…, xn)
= P(xn | xn-1, … , x1) P(xn-1 | xn-2, …, x1) … P(x1)
= P(xn | Par(xn)) P(xn-1 | Par(xn-1)) … P(x1)
Thus, the joint is recoverable using the parameters (CPTs) specified in an arbitrary BN
Check my video…
Burglary Network
P(j, m, a, ¬e, ¬b)
= P(j | a) P(m | a) P(a | ¬e,¬b) P(¬e) P(¬b)
= 0.90 x 0.70 x 0.001 x 0.998 x 0.999
= 0.0006281113
Full joint distribution can be calculated as follows:
P(X1, …, Xn) = P(X1) P(X2 | X1) … P(Xn|X1,…Xn-1)
= ∏ni=1 P(Xi | X1, … Xi-1)
= ∏ni=1 P(Xi | Parent(Xi) )
This makes the calculation a lot simpler …
*
Inference by enumeration
What is the probability
of Burglary given that
John and Mary call?
Formally: P(B | j, m)?
Calculate (by marginalizing/enumerating a, e):
ΣaΣeP(B, j, m, a, e) ←— use Bayes Net to simplify!
Burglary inference via pgmpy
Check Piazza post @285 from Benhao: https://piazza.com/class/kbsmlzxg3k7418?cid=285
Designing Bayesian Networks
Designing Bayesian Networks
Choose a set of variables that describe the domain
Choose an ordering for the variables (!!)
While there are variables left:
Choose a variable Xi and a node to the network
Set Parent(Xi) to a minimal set of nodes such that P(Xi | X1, …, Xi-1) = P(Xi | Parent(Xi) )
(Variable can only have parents that come earlier in the ordering)
Define the conditional probability table for Xi
The trick is to choose the ordering so that Parent(Xi) is much less than {X1, …, Xi-1} …
*
Causal Intuitions
The construction of a BN is simple:
works with arbitrary orderings of variable set
but some orderings are much better than others!
if ordering/dependence structure reflects causal intuitions, a more natural, compact BN results
In this BN, we’ve used the ordering Mal, Flu, Cold, Aches to build BN for distribution P for Aches.
Variable can only have parents that come earlier in the ordering
Causal Intuitions
Suppose we build the BN for distribution P using the opposite ordering: Aches, Cold, Flu, Malaria
resulting network is more complicated!
Mal depends on Aches; but it also depends on Cold, Flu given Aches
Cold, Flu explain away Mal given Aches.
Flu depends on Aches; but also on Cold given Aches
Cold depends on Aches
Compactness
1 + 1 + 1 + 8 = 11 numbers
1 + 2 + 4 + 8 = 15 numbers
If each random variable is directly influenced by at most k others, then each CPT will be at most 2k. Thus the entire network of n variables is specified by n2k.
Contrast that with a full probability table: 2n
Testing Independence (optional)
Given BN, how do we determine if two variables X, Y are independent given evidence E?
we use a (simple) graphical property!
X and Y are conditionally independent given evidence E if E d-separates X and Y
D-separation: A set of variables E d-separates X and Y if E “blocks” every undirected path in the BN between X and Y.
Blocking: Graphical View
Conclusions
Probabilities allow us to reason under uncertainty.
Full probability distribution tables are just not feasible.
Exploit Conditional Independence.
Bayesian Networks is a representation of such independences.
reduces complexity!
Whiteboard used in lectorials