COMP3308/3608, Lecture 11
ARTIFICIAL INTELLIGENCE
Probabilistic Reasoning. Bayesian Networks and Inference.
Reference: Russell and Norvig, ch.13 and ch.14 Witten, Frank, Hall and Pal, ch.9 pp.336-352
Copyright By PowCoder代写 加微信 powcoder
, COMP3308/3608 AI, week 11, 2022
Connect-4 Tournament for COMP3608
• 1st place: and (equally first)
• 2nd place: Kenzie
• 3rd place:
Congratulations!
This Photo by Unknown Author is licensed under CC BY
, COMP3308/3608 AI, week 11, 2022
Probabilistic reasoning
• Probability basics
• Inference using full joint distribution
• Conditional independence
• Product rule, chain rule, theorem of total probability
• Bayes Theorem and its use for inference
Bayesian networks
• Representation and assumptions
• Exact inference by enumeration
• Exact inference by variable elimination
• Relation between Naïve Bayes and Bayesian networks
• Learning CPT from data
, COMP3308/3608 AI, week 11, 2022
Probabilistic Reasoning
, COMP3308/3608 AI, week 11, 2022
Probability Basics – Example
Let’s start with an example:
• 4 Boolean variables: Headache, Fever, Vomiting and Meningitis
• Headaches, fever and vomiting are common symptoms of the disease meningitis
Example from Kelleher et al. 2015, MIT Press
, COMP3308/3608 AI, week 11, 2022
Probability Basics (1)
• We have 4 random variables: Headache, Fever, Vomiting and Meningitis
• Domain of a random variable – the values it can take
• Our variables are binary, so their domain is {true, false}
• Domain of a coin flip: {head, tail}
• Domain of a die roll: {1, 2, 3, 4, 5, 6}
• Event (proposition) – an assignment of a set of variables with values, e.g. Headache=true Vomiting=false
, COMP3308/3608 AI, week 11, 2022
Probability Basics (2)
• Probability P(A=a) is the fraction of times A takes the value a
• P(Fever=true)=4/10, P(Headache=false)=3/10
• P(head)=P(tail)=0.5 for a fair coin
, COMP3308/3608 AI, week 11, 2022
Probability Basics (3)
• Joint probability of 2 variables P(A=a, B=b) is the probability of A=a and B=b to occur together, i.e. of both variables to take specific values
• P(Meningitis=true, Headache=true)= 2/10
• Note that P(A=a, B=b) is a shorthand for P(A=a B=b)
• Joint probability can be similarly defined for more than 2 variables
, COMP3308/3608 AI, week 11, 2022
Probability Basics (4)
• Conditional probability P(A=a| B=b) is the probability of A=a given that we already know that B=b
• P(Meningitis=true| Headache=true)= 2/7
• Probability distribution of a variable A is a data structure that contains the probability for each possible value of A
• P(Meningitis)=<0.3, 0.7> the values should sum to 1
1st value is for true, 2nd is for false
, COMP3308/3608 AI, week 11, 2022
Probability Basics (5)
• Joint probability distribution of 2 variables A and B is a data structure (a table) that shows the probability for all combinations of values of A and B
• P(Vomiting, Meningitis):
Meningitis
P(Vomiting, Meningitis)
, COMP3308/3608 AI, week 11, 2022
Probability Basics (6)
The full joint probability distribution (i.e. of all variables) for our example is:
P(H,F,V,M)
P(H,F,V,M)
All numbers must sum to 1 as this is the full joint probability table Given N variables, each with k values => kN entries in the table (this is a lot!)
, COMP3308/3608 AI, week 11, 2022
Axioms and Theorems of Probability
1) All probabilities are between 0 and 1: P(A) [0,1]
2) Valid propositions have a probability of 1 and invalid propositions
have a probability of 0, i.e. P(true)=1 and P(false)=0 3) Probability of a disjunction:
P(A B) = P(A) + P(B) – P(A B) Some theorems:
For an event A: P(~A)=1-P(A)
For a random variable X with k different values x1,…, xk:
P(X=x1)+…+P(X=xk)=1
=> the probability distribution of a single variable must sum to 1
, COMP3308/3608 AI, week 11, 2022
Inference Using the Full Joint Distribution
Given the full joint probability table, we can compute the probability of any event in the domain by summing over the cells (atomic events) where the event is true
Example: 3 Boolean variables: Toothache, Cavity and Catch
• (What is Catch? Catch=true – the dentist’s steel probe catches
in our tooth)
P(toothache)=?
This is a shorthand for P(Toothache=true)
, COMP3308/3608 AI, week 11, 2022
http://cartoonsmix.com/cartoons /cartoon-tooth-pain.html
Inference Using the Full Joint Distribution
Given the full joint probability table, we can compute the probability of any event in the domain by summing over the cells (atomic events) where the event is true
Example: 3 Boolean variables: Toothache, Cavity and Catch
• (What is Catch? Catch=true – the dentist’s steel probe catches
in our tooth)
P(toothache) = 0.108+0.12+0.16+0.064 = 0.2 This is a shorthand for P(Toothache=true)
This is called
marginalization, or summing out because the variables other than Toothache were summed out (their values were added)
COMP3308/3608 AI, week 11, 2022
Inference Using the Full Joint Distribution (2)
• P(cavity toothache) = ?
• This is a shorthand for P(Cavity=true Toothache=true)
, COMP3308/3608 AI, week 11, 2022
Inference Using the Full Joint Distribution (2)
• P(cavity toothache) = 0.108+0.012+0.072+0.008+0.016+0.064 =0.28
• This is a shorthand for P(Cavity=true Toothache=true)
, COMP3308/3608 AI, week 11, 2022
Conditional Probability, Product and Chain Rules
NB: , means , so this is the same: P(a|b)= P(ab),if P(b)0
1) Definition of conditional probability: P(a|b)= P(a,b),if P(b)0
2) The product rule (follows from 1):
P(a,b) = P(a | b)P(b)
P(a,b) = P(b,a) = P(b | a)P(a)
commutative property
<- It works the other way too
3) Chain rule – derived by successive application of the product rule:
The decomposition holds for any order of the variables
P(x ,...,x )=P(x |x ,...,x ),e.g.
1n ii−11 i=1
P(a,b) = P(a | b)P(b)
P(a,b,c) = P(a | b,c)P(b | c)P(c)
P(a,b,c,d) = P(a | b,c,d)P(b | c,d)P(c | d)P(d)
, COMP3308/3608 AI, week 11, 2022
Let’s do Some Reasoning Using Conditional Probability!
P(~cavity| toothache) = ?
This is a shorthand for P(Cavity=false| Toothache=true)
P(cavity | toothache) = P(cavity, toothache) = P(toothache)
= 0.016 + 0.064 = 0.08 = 0.4 0.108 + 0.012 + 0.016 + 0.064 0.2
, COMP3308/3608 AI, week 11, 2022
Let’s do Some Reasoning Using Conditional Probability! (2)
And let’s calculate the opposite: P(cavity| toothache) = ? This is a shorthand for P(Cavity=true| Toothache=true)
P(cavity | toothache) = P(cavity, toothache) = P(toothache)
= 0.108 + 0.012 = 0.12 = 0.6 0.108 + 0.012 + 0.016 + 0.064 0.2
Good - 0.6 and 0.4 sum up to 1 as expected! => P(Cavity| toothache) = <0.6, 0.4>
, COMP3308/3608 AI, week 11, 2022
Normalization
The denominator is the same and can be viewed as a normalization constant :
P(cavity | toothache) = P(cavity, toothache) =
= * (0.108 + 0.012) = * 0.12
P(cavity | toothache) = P(cavity, toothache) =
*(0.016 + 0.064) = *0.08
The probability distribution with :
P(Cavity | toothache) = 0.12,0.08 Computing the normalization constant:
=1/(0.12+0.08)= 1/0.2
Final probability distribution without :
P(Cavity | toothache) = 0.12 / 0.2, 0.08 / 0.2 = 0.6,0.4 , COMP3308/3608 AI, week 11, 2022
Normalization – More Generally:
Another way to write this using the P notation:
P(Cavity | toothache) = P(Cavity, toothache) =
= [P(Cavity,toothache,catch) + P(Cavity,toothache,catch)] = = [ 0.108,0.016 + 0.012,0.0064 ] =
= 0.12, 0.08 = 0.12 /(0.12 + 0.08), 0.08 /(0.12 + 006) = = 0.6, 0.4
, COMP3308/3608 AI, week 11, 2022
Inference by Enumeration Using Full Joint Distribution
General rule:
• X – query variable (e.g. Cavity)
• E – evidence (e.g. Toothache)
• e – the observed values for E (e.g. true)
• H – the remaining (hidden) variables: H = X – E
The summation is over the values of the hidden variables, i.e. we are summing out the hidden variables
𝐏 𝑋𝐸=𝑒 =𝐏(𝑋,𝐸=𝑒)=
P(Cavity | Toothache = true) = P(Cavity,| Toothache = true) =
= P(Cavity,Toothache = true,Catch = h) h=true, false
= [P(Cavity,Toothache = true,Catch = true) + P(Cavity,Toothache = true,Catch = false)]
𝐏(𝑋,𝐸=𝑒,𝐻=h)
, COMP3308/3608 AI, week 11, 2022
Pseudocode
• It doesn’t scale well:
• For n Boolean variables it requires a joint probability table of size O(2 )
and O(2n) time to process it
• Impractical for real problems: hundreds of random variables
, COMP3308/3608 AI, week 11, 2022
• 2 events A and B are independent, if: P(A,B) = P(A)*P(B)
• This condition is equivalent to: P(A|B) = P(A) and
P(B|A) = P(B) Can you show this?
• Independence is domain knowledge!
• It makes the inference easier
Independence
COMP3308/3608 AI, week 11, 2022
Independence – Example
product rule: P(A,B) = P(A|B)P(B)
Dental example: Suppose that we have 1 more binary variable in the dental example: Weather with values cloudy and sunny. Task – compute:
P(Weather = cloudy,toothache,catch,cavity) =
|= P(Weather = cloudy | toothache,catch,cavity)P(toothache,catch,cavity)
= 𝑃 𝑊𝑒𝑎𝑡h𝑒𝑟 = 𝑐𝑙𝑜𝑢𝑑𝑦
product rule
However, a person’s dental problems do not influence the weather
=> 𝑃 𝑊𝑒𝑎𝑡h𝑒𝑟 = 𝑐𝑙𝑜𝑢𝑑𝑦 𝑡𝑜𝑜𝑡h𝑎𝑐h𝑒, 𝑐𝑎𝑡𝑐h, 𝑐𝑎𝑣𝑖𝑡𝑦 = 𝑃 𝑊𝑒𝑎𝑡h𝑒𝑟 = 𝑐𝑙𝑜𝑢𝑑𝑦
𝑃 𝑊𝑒𝑎𝑡h𝑒𝑟 = 𝑐𝑙𝑜𝑢𝑑𝑦, 𝑡𝑜𝑜𝑡h𝑎𝑐h𝑒, 𝑐𝑎𝑡𝑐h, 𝑐𝑎𝑣𝑖𝑡𝑦 = = 𝑃 𝑊𝑒𝑎𝑡h𝑒𝑟 = 𝑐𝑙𝑜𝑢𝑑𝑦 𝑃 𝑡𝑜𝑜𝑡h𝑎𝑐h𝑒, 𝑐𝑎𝑡𝑐h, 𝑐𝑎𝑣𝑖𝑡𝑦)
independence rule: P(A|B) = P(A)
Instead of storing a table with 24 entries, we can store 23 + 21 . This also reduces the complexity of the inference problem.
, COMP3308/3608 AI, week 11, 2022
Conditional Independence
• Absolute independence is a very useful property but it is rare in practice
• Although random variables are rarely absolutely independent, they are often conditionally independent
• Example:
If I have a cavity, the probability that the probe catches in my tooth doesn’t depend on whether I have a toothache (it depends on the skill of the dentist):
P(catch | toothache, cavity) = P(catch | cavity)
The same independence holds if I do not have a cavity:
P(catch | toothache, cavity) = P(catch | cavity)
In other words, Catch is conditionally independent of
Toothache given Cavity:
P(Catch | Toothache,Cavity) = P(Catch | Cavity)
, COMP3308/3608 AI, week 11, 2022
Conditional Independence (2)
• Similarly, based on domain knowledge we can assert that Toothache is conditionally independent of Catch given Cavity:
• Let’s see how conditional independence simplifies inference:
P(Toothache | Catch,Cavity) = P(Toothache | Cavity)
P(Toothache,Catch,Cavity) =
= P(Toothache | Catch,Cavity)P(Catch | Cavity)P(Cavity) = = P(Toothache | Cavity)P(Catch | Cavity)P(Cavity)
• So we need 22+22+21=10 probability numbers or 5 independent numbers (as counterparts sum to 1) which is less than without conditional independence
• In most cases, conditional independence reduces the size of the representation of the joint distribution from O(2n) to O(n)
• n is the num. of variables conditionally independent, given another variable
• => Conditional independence assertions: 1) allow probabilistic systems to scale up and 2) are more available than absolute independence
assertions
chain rule
, COMP3308/3608 AI, week 11, 2022
Bayes Theorem
• We know it already!
• But this time we will use it for reasoning not classification
• We can see where it comes from – from the product rule:
P(a,b) = P(a | b)P(b)
P(a,b) = P(b,a) = P(b | a)P(a) commutativity
• Equating the 2 right-hand sides: P(a | b)P(b) = P(b | a)P(a)
=> P(a|b)= P(b|a)P(a) P(b)
P(B | A)P(A) P(A | B) = P(B)
Bayes Theorem
<- More general form for multivalued variables using the P notation
<- General form with
, COMP3308/3608 AI, week 11, 2022
P(A | B) = P(B | A)P(A)
Using the Bayes Theorem for Reasoning - Example
Q1. What is the probability that the patient has the disease?
Q2. Why is the rarity of the disease good news given that the patient has tested positive for it?
Example from Kelleher et al. 2015, MIT Press
, COMP3308/3608 AI, week 11, 2022
Answering Q1
Q1. What is the probability that the patient has the disease? Variables: D – hasDisease {true, false) , T – positiveTest {true, false}
This is a shorthand for: P(D = true | T = true) = ? P(d | t) = P(t | d)P(d) = 0.99*0.0001
P(d | t) = ?
To calculate P(t) we will use the Theorem of Total Probability , COMP3308/3608 AI, week 11, 2022
Theorem (Rule) of Total Probability
• The unconditional probability for any event X is:
P(X)=P(X |Yi)P(Yi) i
where Y is another event and Yi are all possible outcomes for Y
• This is actually the summing out rule that we used before!
P(t) = P(t | d)P(d) + P(t | d)P(d) = = 0.99 * 0.0001 + 0.01* 0.9999 = 0.0101
P(d | t) = P(t | d)P(d) = 0.99*0.0001 = 0.0098
P(t) 0.0101
=> The probability to have the disease, given that the test is positive is 0.98%. This is very low: <1%.
, COMP3308/3608 AI, week 11, 2022
Answering Q2
• Q2. Why is the rarity of the disease good news given that the patient has tested positive for it?
• Because P(d|t) is proportional to P(d), so a lower P(d) means a lower P(d|t)
P(d | t) = P(t | d)P(d) = 0.99*0.0001 = 0.0098
P(t) 0.0101
• In other words, the Bayes theorem explicitly includes the prior probability of an event when calculating the likelihood of that event based on evidence
• Hence, for our example, when the disease is much rarer than the accuracy of the test, a positive test result does not mean the disease is likely
, COMP3308/3608 AI, week 11, 2022
• Probability is a rigorous formalism for representing uncertain knowledge
• Given the full joint probability distribution, we can compute the probability of any event in the domain by summing over the cells (atomic events) where the event is true
• However, the full joint distribution is usually too large to store and use for inference
• Independence and conditional independence allow probabilistic systems to scale up
• We also saw how to use several rules and theorems for probabilistic reasoning, e.g. the definition of conditional probability, the product rule, the chain rule, the Bayes theorem and the theorem of total probability
, COMP3308/3608 AI, week 11, 2022
Bayesian Networks
, COMP3308/3608 AI, week 11, 2022
Bayesian Networks - Motivation
• To do probabilistic reasoning, we need to know the joint probability distribution of the variables in the domain
• But if we have N binary variables, we need 2N numbers to specify the joint probability distribution
• As we saw, usually there are some variables that are independent (fully independent or conditionally independent), so we don’t need all these 2N probability values
• We would like to explicitly represent and exploit this independence
• Bayesian networks allow us to do this
, COMP3308/3608 AI, week 11, 2022
Bayesian Networks
• Bayesian networks are graph-based models that encode structural relationships between variables such as direct influence and conditional independence
• Hence, they provide a more compact representation than a full joint probability distribution
• Bayesian networks are directed graphs without cycles (DAG – Directed Acyclic Graphs) composed of 3 elements:
• Nodes – there is 1 node for each variable
• Links – each link denotes dependence between 2 nodes: “directly
influences”
• Conditional Probability Tables (CPT) for each node, given its parents
, COMP3308/3608 AI, week 11, 2022
Bayesian Network – Example 1
• 2 binary variables: A and B
• Meaning of the link: A directly influences B (= the values of A directly
influence the values of B)
• Terminology: A is a parent node of B, B is a child node of A
• A has no parents, so CPT only shows the unconditional probability distribution
• Each row in CPT sums to 1 => for a binary variable X we can simply state the probability for true P(X=T), as P(X=F)=1-P(X=F)
• => we don’t need the second CPT column = simplified form
• The simplified form is typically used simplified CPT
Example from Kelleher et al. 2015, MIT Press
, COMP3308/3608 AI, week 11, 2022
Bayesian Network – Example 2
• I am at work, neighbour John calls to say my alarm is ringing, but neighbour Mary doesn’t call. Sometimes the alarm is set off by minor earthquakes. Is there a burglar?
• Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls
• The network topology reflects causal knowledge:
• A burglar can set the alarm off
• An earthquake can set the alarm off
• The alarm can cause Mary to call
• The alarm can cause John to call
COMP3308/3608 AI, week 11, 2022
Example 2 with CPTs
• For the burglary example we are also given these CPTs:
, COMP3308/3608 AI, week 11, 2022
Compactness
• Consider a domain with n Boolean variables
• We have created a Bayesian network
• The CPT for a variable X with k parents will have 2k rows
• 1 row for each combination of parent values, e.g. Alarm has 2
parents and will have 4 rows
• Each row has 1 number p for X=true, the number for X=false is 1-p
• If each variable has at most k parents, the complete Bayesian network requires O(n* 2k ) numbers, i.e. it grows linearly with n
• For comparison, the full joint distribution table requires O(2n ) numbers which is much bigger – grows exponentially with n
• For our burglary example: 1+1+4+2+2=10 numbers in the CPT of the Bayesian net vs 25=32 in the full joint distribution table
, COMP3308/3608 AI, week 11, 2022
Joint Probability Rule
• Given a Bayesian network with n nodes, the probability of an event x1, x2, …, xn can be computed as:
P(x ,…,x )=P(x |Parents(x ))
• The full joint distribution is the product of the local conditional distributions
• Example for the burglar net:
P(j, m, a, ~b, ~e) =
= P(j|a) P(m|a) P(a|~b,~e) P(~b)P(~e) = = 0.9*0.7*0.001
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com