代写代考 COMP30024 Artificial Intelligence

COMP30024 Artificial Intelligence
Week 11 Problem Sheet Solutions
• Tutors: These are not unique, complete (and definitely not optimally pedagogical) solutions. there are other ways of deriving these results and the way you prefer is to some extent an aesthetic judgement.
• Students: If you want to attain any benefit from the problems you should make a genuine attempt at them before you see the solutions.

Comments/corrections are as always greatly appreciated by the teaching team.
(Based on RN 14.15). Leo is a botanist who lives in the Bay Area. His neighbourhood is a
hotspot for burglars, so his house is fitted with an alarm system. Unfortunately, the alarm
is not perfectly reliable: it doesn’t always trigger during a home invasion, and it may be
erroneously triggered during minor earthquakes, which occur occasionally. Leo has asked his
Chapter 13 Probabilistic Reasoning
neighbours John and Mary to call him if they hear the alarm. This setup can be represented as the following Bayesian network.
Earthquake
P(A=true|B,E)
tt tf ft ff
.70 .01 .70 .01
P(J=true|A)
P(M=true|A)
Figure 13.2 A typical Bayesian network, showing both the topology and the conditional 1
probability tables (CPTs). In the CPTs, the letters B, E, A, J, and M stand for Burglary, Earthquake, Alarm, JohnCalls, and MaryCalls, respectively.

1. Explaining Away [T] Consider the above Bayesian network.
(a) How many parameters (i.e. probability values) are required to model this network? If we add another boolean-valued parent to the alarm variable, how many parameters are needed to model the alarm node? 10, 8 respectively.
(b) For boolean Xi, how many parameters are required to model the joint distribution p(X1, . . . , Xn)? Is there a unique factorization of the joint distribution?
From the chain rule, factorizing sequentially:
p(X1, . . . , Xn) = p(X1)p(X2|X1)p(X3|X1, X2) . . . p(Xn|X1, X2, . . . , Xn)
This requires O(2n) parameters in the binary case and O(kn) parameters in the k-ary case. Without additional structure there is no canonical factorization, we can use the chain rule in any order we like.
(c) Now place a Bayesian network structure on the joint p(X1 , . . . , Xn ). Derive an ex- pression, in terms of the maximum number of parents for any RV in the Bayes net, P∗ = maxi|Parents(Xi)|, for the number of parameters required to model the joint.
For a general Bayesian network with k-ary variables, we require O(n × kP ∗ ) variables, where P ∗ = maxi |Parents(Xi )|. Witness the raison d’ˆetre of Bayesian networks – the space required for the representation of the joint falls from exponential in n, the number of RVs, to linear in n.
(d) If no evidence is observed, are the Burglary and Earthquake nodes independent? Marginalizing over unobserved variables in the joint distribution and simplifying the summation, we have
p(B,E) = 􏰖􏰖􏰖p(A,B,E,J,M) (1) AJM
= 􏰖 􏰖 􏰖 p(B)p(E)p(A|B, E)p(J|A)p(M|A) (2) AJM
= p(B)p(E) 􏰖 p(A|B, E) 􏰖 p(J|A) 􏰖 p(M|A) (3) AJM
Note that every sum above is equal to 1 by the law of total probability – i.e. the variables A, J, M are irrelevant to the query. More generally, every variable that is not an ancestor of a query/evidence variable is irrelevant. So we have p(B, E) = p(B)p(E) in this case.
(e) Assume we observe Alarm = True, now are Burglary and Earthquake independent? Jus- tify your answer by calculating whether the probabilities involved satisfy the definition of conditional independence.
By Bayes’ Theorem
p(B|E, a) = Z1 p(B, E, a) (4) = Z1 p(a|B, E)p(B)p(E) (5)
p(a|B, E)p(B)
= 􏰕B p(a|B, E)p(B) (6)

Where Z = 􏰕B p(a|B,E)p(B)p(E) is a normalizing constant (confusingly, also called the ’evidence’). For B and E to be conditionally independent given A = a, p(B|E,a) should not depend on E. This is not possible as p(a|B,E) depends on E (validate this numerically). There is an easier way to determine if RVs in a Bayes net are conditionally independent given a set of conditioning variables using the notion of d-separation. Here B → A ← E, with A observed – hence the path connecting B and E is active, meaning B ̸⊥ E | A.
(f) (Discussion) Explain intuitively why knowledge of the Alarm random variable renders the random variables Earthquake and Burglary non-conditionally independent.
Under our assumptions, two possible explanations for the alarm going off (Alarm=True) are a minor earthquake occurring (Earthquake=True) and a burglary occurring (Bur- glary=True). If we know that (Alarm=True) and (Earthquake=True) then intuitively we think that a burglary has occurred is unlikely (Burglary=False) (Leo and the burglar would have to be very unlucky to have an earthquake and burglary occur at the same time):
p(b|a, e) < p(¬b|a, e) (7) Where we have denoted an RV being true by replacing the capitalized RV by lowercase, e.g. P (B = True|A, E) = p(b|A, E). This is reasonably intuitive: for observed data with two possible explanations, knowing one of the explanations has occurred is sufficient to explain the observation and decreases the probability that the other possible explanation has also occurred - there is now a conditional dependence between causes, given obser- vation of the children. Hence for any PGM with the structure X → Z ← Y, knowledge of Z couples X and Y. 2. Burglary or Earthquake? [T] When Leo gets to work in the morning, he sees that John tried to call him, but left no message. There has also been a call from Mary. Assume John and Mary did not confer before calling. Using the Bayesian network above, derive an expression for the probability of a burglary having occurred given the above information (though possible, there is no need to compute the exact numerical value here). Solution: Firstly we factorize the joint distribution according to the topology of the Bayesian network: p(A, B, E, J, M) = p(B)p(E)p(A|B, E)p(J|A)p(M|A) (8) Our query variable is B, and our evidence variables are J = j, M = m (John calls and Mary calls, respectively). The hidden or unobserved variables are E and A (earthquake and alarm occurring, respectively). Using Bayes’ Theorem, we can express the posterior distribution over the query B as proportional to the full joint distribution, then exploit the factorization and marginalize over the hidden variables to obtain: p(B|j,m) = Z1 􏰖􏰖P(A,B,E,J = j,M = m) (9) AE = Z1 􏰖􏰖p(B)p(E)p(A|B,E)p(j|A)p(m|A) (10) AE = Z1 p(B) 􏰖 p(E) 􏰖 p(A|B, E)p(j|A)p(m|A) (11) EA It is enough to stop here. However, for interest we evaluate this conditional. In the last line we push the summations as far right as possible using the distributive property to avoid excessive computation. Performing full enumeration, we would need to proceed left to right, looping over all possible values for E - and for each possible value of E we need to loop over all possible values of A. This should remind you of nested for loops. We proceed in reverse (right to left) order, caching intermediate results to avoid repeated calculations. Start by focusing on the rightmost sum over A, and use the convention that B runs over the columns and E runs over the rows. We note that P(A|B,E) is a 2×2×2 matrix, and the sum is over 2×2 ’slices’ of the whole matrix. 􏰖 p(A|B, E)p(j|A)p(m|A) = p(a|B, E)p(j|a)p(m|a) + p(¬a|B, E)p(j|¬a)p(m|¬a) A p(E) p(j, m|b, ¬e) p(j, m|¬b, ¬e) = = 􏰋p(j, m|b) p(j, m|¬b)􏰌 􏰎 p(a|b, e) p(a|¬b, e) 􏰏 p(a|b, ¬e) p(a|¬b, ¬e) × 0.7 × 0.9 + 􏰎 p(¬a|b, e) p(¬a|¬b, e) 􏰏 p(¬a|b, ¬e) p(¬a|¬b, ¬e) × 0.05 × 0.1 􏰎0.95 0.29 􏰏 􏰎0.05 0.71 􏰏 = 0.94 0.001 ×0.7×0.9+ 0.06 0.999 ×0.05×0.1 􏰎 p(j, m|b, e) p(j, m|¬b, e) 􏰏 = p(j, m|b, ¬e) p(j, m|¬b, ¬e) So we have marginalized, or ’eliminated’ the hidden variable A from our intermediate results. Now to evaluate the sum over E. Note that the top row corresponds to E = e, and the bottom corresponds to E = ¬e, so we have to perform vector matrix multiplication: 􏰖 􏰎 p(j,m|b,e) p(j,m|¬b,e) 􏰏 􏰋 􏰌􏰎 p(j,m|b,e) p(j,m|¬b,e) 􏰏 p(¬e) p(j, m|b, ¬e) p(j, m|¬b, ¬e) p(e) = 􏰋0.002 Now we have marginalized/eliminated the hidden variable E from our intermediate result to leave a factor only depending on B. Now take the Schur (elementwise) product for the last part (because the left column corresponds to B = b and the right to B = ¬b) to find: p(B|j, m) = Z1 p(B) ⊙ 􏰋p(j, m|b) p(j, m|¬b)􏰌 = Z1 􏰋p(j, m|b) p(j, m|¬b)􏰌 = Z1 􏰋0.00059224 0.00149186􏰌 Normalization implies that Z = 􏰕B p(B, j, m) = 0.002084100239, giving the posterior distri- p(B|j, m) = 􏰋0.28417184, 0.71582816􏰌 (12) You may have found the above sequence of matrix products quite chaotic. It is sensible if you keep track of which variable is being eliminated at each step. This allows you to establish a clear correspondence between the axes of your matrix/tensor and the variables to be eliminated. For a more efficient approach, you may use the ’pointwise’ product method outlined in RN Ch. 14.4.2 or the elimination algorithm presented in lecture. 3. Inference along a Chain [T] Suppose a network has the form of a chain: a sequence of BooleanvariablesX1,...Xn whereParents(Xi)={Xi−1}fori=2,...n. X1 →X2 →X3 →···→Xn (13) First derive an expression for the probability P(X1|Xn = True). p(X1|Xn = T) ∝ 􏰖􏰖... 􏰖 p(X1,...Xn = T) X2 X3 Xn−1 What is the complexity of computing the inference P(X1|Xn = True) using: (a) Enumeration? Here the query is X1 and the evidence is Xn = True. The hidden variables are then {X2, . . . Xn−1}. We will fix Xn and eliminate the hidden variables in the order (Xn−1, Xn−2, . . . , X2). Combining Bayes’ Theorem and the Bayesian network structure together with marginalization, the posterior is evaluated as: p(X1|Xn = T) ∝ 􏰖􏰖... 􏰖 p(X1,...Xn = T) X2 X3 Xn−1 = 􏰖􏰖... 􏰖 p(X1) 􏰗p(Xi|Xi−1) p(Xn = T|Xn−1) X2 X3 Xn−1 i=2 = p(X1) 􏰖 p(X2|X1) . . . 􏰖 p(Xn−1|Xn−2)p(Xn = T |Xn−1) (16) Naive enumeration proceeds right to left - we see that there are n − 2 nested sums over boolean variables for computing each value of X1 = T/F - each Xi can only take on 2 variables, so the total complexity is O(2n−1) (because we have to calculate the nested sum twice naively for each value of X1). For discrete variables with at most k values in the domain, the runtime is O(kn−1), as there are k terms in each sum. (b) Variable Elimination? Contrast the space complexity with na ̈ıve enumeration. Variable elimination is a form of dynamic programming where intermediate results are cached. We keep the result of each marginalization operation for reuse in the next marginalization down the line. Hence we would expect the complexity to be reduced from full enumeration, because we are avoiding the nested exponential blowup, trading off on-the-cuff computation with increased space complexity. We consider each sum right-to-left. First we consider the sum over Xn−1, replacing it with an intermediate factor that only depends on Xn−2. p(X1|Xn = T ) ∝ p(X1) 􏰖 p(X2|X1) . . . 􏰖 p(Xn−1|Xn−2)p(Xn = T |Xn−1) (17) X2 Xn−1 = p(X1) 􏰖 p(X2|X1) . . . 􏰖 p(Xn−2|Xn−3)f(Xn−2) (18) X2 Xn−2 Iterate the process, caching intermediate results, until we reach p(X1) - e.g. f(Xn−3) = 􏰖 p(Xn−2|Xn−3)f(Xn−2), (19) f(Xn−4) = 􏰖 p(Xn−4|Xn−3)f(Xn−3) (20) and so on. Note each f(Xn−i) is a 2-tuple - one value for each of the possibilities Xn−i = True/False for k-ary variables f(Xn−i) would be a k-tuple. This takes n steps. The work done at each step of marginalization takes 2 × 2 operations (because we are multiplying all possible values of the node being marginalized with all possible values of the parent). In the more general case where each variable may assume up to k possible values, each sum evaluation takes |Xi−1|×|Xi|≤ k2 operations for a O(nk2) runtime. To proceed more formally, you can show this by induction. (c) (Bonus [∗]) What is the complexity of variable elimination on a general Bayes net structure? Express your answer in terms of: • The number of variables in the full joint, n. • The maximum number of values that variable may assume, k = maxi|Xi|. • The size of the largest intermediate factor generated, S. Each iteration of variable elimination isolates a particular variable Xi and marginalizes over the variable by performing the summation over only the subset of factors involving Xi. This generates a new intermediate factor with the variable Xi ’eliminated’. Let X(i) = {X1,...,Xi,...,Xs(i)} be the variables inside the subset of factors when eliminating Xi. The general computation can be expressed as: Ji 􏰑 􏰒 φ(X(i)\{Xi})=􏰖􏰗φ Xi,X(j) (21) Where J is the total number of factors involving Xi and X(j) denotes all variables outside Xi in factor j. We are taking the sum over all possible configurations of variables X (i) \ {Xi} = {X1, . . . Xi−1, Xi+1, . . . , Xs(i)} above. For each configuration, there are Ji × |Xi|≤ kJi multiplication ops and |Xi|≤ K addition ops. There are at most ks(i)−1 possible configurations we have to consider when eliminating variable Xi. Let S = maxm{|Xm| | Xm ∈ X(i)} - i.e. the size of the largest factor φ. Then the complexity of VE is O(nkS+1) (as the algorithm has n steps and the runtime is dominated by the largest factor in the exponent). Now for the important part: the runtime of variable elimination grows exponentially with the maximum size of the intermediate factor generated. So the overall complexity of the algorithm is dependent on the number of variables in the largest clique formed during moralization. This quantity is referred to as the treewidth. In general the problem of finding the best elimination ordering of a graph which minimizes the cardinality of the largest clique formed is NP-hard, but there exist a number of useful heuristics that can be used to find efficient/good elimination orders. 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts