COMP9418: Advanced Topics in Statistical Machine Learning
Variable Elimination
Instructor: University of Wales
Copyright By PowCoder代写 加微信 powcoder
Introduction
§ This lecture introduces ones of the simplest methods for inference
§ It is based on the principle of variable elimination
§ We successively remove variables from the Bayesian network, maintaining its ability to answer queries of interest
§ In previous lectures we identified four types of queries
§ Probability of evidence, prior and posterior marginals, MPE and MAP § Variable elimination can be used to answer all these types of queries § But we will leave MPE and MAP to a future lecture on this topic
§ We will discuss the algorithm of variable elimination § Its complexity and how to make it more efficient
§ How to implement it in the tutorials
§ Its variants such as bucket elimination
Process of Elimination
§ Given this Bayesian network
§ We are interested in computing the marginal
𝐴 𝐵 𝑎 𝑏 𝑎 𝑏+ 𝑎+ 𝑏 𝑎+ 𝑏+
𝜃”|! Winter? 𝐴 𝜃!
.8 .75 .25
Sprinkler? (𝐵)
.30443 .39507 .05957 .24093
&|”,$ Wet Grass?
Slippery Road? (𝐸)
𝐶 𝜃$|! 𝑐 .8
𝑐̅ .2 𝑐 .1 𝑐̅ .9
§ The variable elimination (VE) algorithm, § Sums out variable 𝐴, 𝐵 and 𝐶 to construct a
marginal over 𝐷 and 𝐸
𝑏𝑐𝑑̅ .95 𝑏𝑐𝑑 .05 𝑏 𝑐̅ 𝑑̅ .9 𝑏 𝑐̅ 𝑑 .1 𝑏+ 𝑐 𝑑 .8 𝑏+ 𝑐 𝑑̅ .2 𝑏+ 𝑐̅ 𝑑 0 𝑏+ 𝑐̅ 𝑑̅ 1
𝑐 𝑒 .7 𝑐 𝑒̅ .3
𝑐̅ 𝑒 0 𝑐̅ 𝑒̅ 1
𝐴 𝑎 𝑎 𝑎+ 𝑎+
Process of Elimination
§ Consider the joint distribution over all variables. To sum out variable 𝐴
§ Merge all rows that agree on values of 𝐵,𝐶,𝐷, and 𝐸
𝐴 𝐵 𝐶 𝐷 𝐸 𝑃(.) 𝑎 𝑏 𝑐 𝑑 𝑒 .06384 𝑎+ 𝑏 𝑐 𝑑 𝑒 .01995
§ Into a single row 𝐵𝐶𝐷𝐸 𝑃(.)
𝑏 𝑐 𝑑 𝑒 .08379 = .06384 + .01995
§ Resulting in a table with 16 rows that do not mention variable 𝐴
Process of Elimination
§ An important property of summing out variables
§ The new distribution is as good as the original one
§ As far as answering queries that do not mention 𝐴
§ That is 𝑃’ 𝛼 = 𝑃(𝛼) for any event 𝛼 that does not involve 𝐴
§ Therefore, if we want to compute a marginal distribution, say, over 𝐷 and 𝐸
§ We just sum out variables 𝐴, 𝐵 and 𝐶 from the joint distribution
§ However, this procedure is exponential in the number of variables
§ The key insight of VE is that we can sum out variables without constructing the joint probability
§ This allows to sometimes escape the exponential complexity
§ A factor is a function over a set of variables
§ It maps each instantiation of these variables to a non-negative
§ In some cases the number presents a probability. It may represent a distribution (e.g., 𝑓!) or a conditional distribution (e.g., 𝑓”)
§ A factor over an empty set of variables is called trivial § It assigns a single number to the trivial instantiation ⊺
§ There are two main operations over factors § Summing out a variable
§ Multiplying two factors
§ These operations are building blocks of many inference algorithms
𝐵 𝐶 𝐷 𝑏 𝑐 𝑑̅ 𝑏 𝑐 𝑑 𝑏 𝑐̅ 𝑑̅ 𝑏 𝑐̅ 𝑑 𝑏+ 𝑐 𝑑 𝑏+ 𝑐 𝑑̅ 𝑏+ 𝑐̅ 𝑑 𝑏+ 𝑐̅ 𝑑̅
.05 .9 .1 .8 .2 0 1
𝐷 𝐸 𝑓B 𝑑 𝑒 .448
𝑑̅ 𝑒̅ .192 𝑑̅ 𝑒 .112 𝑑 𝑒̅ .248
Summing Out
§ Let 𝑓 be a factor over variables 𝑿 and let 𝑋 be a variable in 𝑿.
§ The result of summing out variable 𝑋 from the factor 𝑓 is another factor over variables 𝒀 = 𝑿 ∖ 𝑋 which we denote by ∑# 𝑓
§ To illustrate this process consider the factor 𝑓! § Summing out variable 𝐷 results in a new factor ∑$ 𝑓” § If we sum out all variables, we get a trivial factor
∑” ∑$ ∑& 𝑓C
“𝑓 (𝒚) ≝ “𝑓(𝑥,𝒚) !”
𝑏 𝑐 𝑑̅ .95
𝑏𝑐𝑑.05 𝐵𝐶∑&𝑓C 𝑏 𝑐̅ 𝑑̅ .9
𝑏+ 𝑐̅ 𝑑̅ 1
𝑏 𝑐 1 𝑏 𝑐̅ 1 𝑏+ 𝑐 1 𝑏+ 𝑐̅ 1
Summing Out
§ The summing-out operation is commutative
§ Therefore, we can sum out multiple variables
without fixing an order
§ This justifies the notation ∑𝑿 𝑓, where 𝑿 is a set
§ This algorithm provides the pseudocode for summing out any number of variables
§ It is 𝑂(exp(𝑤)) time and space
§ 𝑤 is the number of factor variables
§ This operation is also known as marginalization
§ ∑𝑿 𝑓 is also known as projecting factor 𝑓 on variables 𝒀
“”𝑓=””𝑓 $! !$
Input: 𝑓(𝑿) and 𝒁
Output: ∑𝒁 𝑓
𝑓’ ← a factor over variables 𝒀 where 𝑓’(𝒚) = 0 for all 𝒚 for each instantiation 𝒚 do
for each instantiation 𝒛 do 𝑓’𝒚 ←𝑓’𝒚 +𝑓(𝒚𝒛)
Multiplication
§ The second operation over factors is multiplication
§ If we multiply two factors, we construct a new factor over the union of their variables
§ Each instantiation on the new factor is compatible with exactly one instantiation on each original factor
§ The result of multiplying two factors 𝑓! 𝑿 and 𝑓5(𝒀) is another factor over variables 𝒁 = 𝑿 ∪ 𝒀, denoted by 𝑓!𝑓5
§ (𝑓”𝑓!)(𝒛) ≝ 𝑓” 𝒙 𝑓!(𝒚)
§Where𝒙and𝒚arecompatiblewith𝒛,thatis𝒙~𝒛and 𝒚~𝒛
𝑏 𝑐̅ 𝑑 .1 𝑏+ 𝑐 𝑑 .8 𝑏+ 𝑐 𝑑̅ .2 𝑏+ 𝑐̅ 𝑑 0 𝑏+ 𝑐̅ 𝑑̅ 1
.05 𝐷 𝐸 𝑓B
𝑑 𝑒 .448 𝑑 𝑒̅ .192 𝑑̅ 𝑒 .112 𝑑̅ 𝑒̅ .248
𝐵𝐶𝐷 𝐸 𝑏 𝑐 𝑑 𝑒
𝑏 𝑐 𝑑 𝑒̅ 𝑏 𝑐 𝑑̅ 𝑒
𝑓C 𝐵,𝐶,𝐷 𝑓B(𝐷,𝐸) .4256= .95.(.448) .1824 = .95 . (192) .0056= .05(.112) ⋮ .2480=1(.2480)
Multiplication
§ Factor multiplication is commutative and associative
§Wecanmultiplyseveralfactorswithout specifying the order of the multiplication
§ This algorithm provides a pseudocode for multiplying 𝑚 factors
§ It is 𝑂(𝑚exp 𝑤 ) time and space
§ 𝑤 is the number of variables in the resulting factor
Input:𝑓% 𝑿% ,…,𝑓&(𝑿&) Output: ∏& 𝑓
𝒁←⋃𝒎 𝑿 𝒊(𝟏 ‘
𝑓 ← a factor over variables 𝒁 where 𝑓(𝒛) = 1 for all 𝒛 for each instantiation 𝒛 do
for𝑖 =1to𝑚do
𝒙’ ← instantiation of variables 𝑋’ consistent with 𝒛
𝑓𝒛 ←𝑓𝒛𝑓'(𝒙’) return 𝑓
Variable Elimination (VE)
§ Suppose we want to compute the joint probability distribution for this network
§ We can use the chain rule for Bayesian networks
§ We can multiply the CPTs, viewing each CPT as a factor
§ Suppose we want to compute the marginals for variables 𝐷 and 𝐸
§ We need to sum out variables 𝐴, 𝐵 and 𝐶
𝑃 𝐷, 𝐸 = S Θ*|)Θ$|()Θ)|&Θ(|&Θ&
§ This is a combination of marginalization and multiplication § However, it still has the problem of complexity
𝑃 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 = 𝜃,|.𝜃/|0.𝜃.|1𝜃0|1𝜃1 Θ2|3Θ4|53Θ3|6Θ5|6Θ6
Winter? (𝐴)
Sprinkler? (𝐵)
Wet Grass? (𝐷)
Slippery Road? (𝐸)
Variable Elimination (VE)
§ If 𝑓! and 𝑓5 are factors and if variable 𝑋 appears only in 𝑓5, then
§ If 𝑓”, … , 𝑓, are the CPTs of a Bayesian network and if we want to sum out variable 𝑋
§ We may not need to multiply all these factors first
§ For instance,
§ If variable 𝑋 appears only in factor 𝑓,
§ But, if variable 𝑋 appears in two factors 𝑓,-” and 𝑓,
§ In general, we need to multiply all factors 𝑓F that include 𝑋 and then sum out 𝑋 from ∏F 𝑓F
S 𝑓” 𝑓! = 𝑓” S 𝑓! ##
S𝑓” …𝑓, = 𝑓” …𝑓,-” S𝑓, ##
S𝑓” …𝑓, = 𝑓” …𝑓,-! S𝑓,-“𝑓, ##
Variable Elimination (VE)
§ Consider this network and assume the goal is to compute 𝑃(𝐶)
§ We will first eliminate 𝐴 and then 𝐵
§ There are two factors involving 𝐴: Θ& and Θ(|&
𝐴 𝐵 Θ!Θ”|! 𝑎 𝑏 .54 𝑎 𝑏+ .06 𝑎+ 𝑏 .08 𝑎+ 𝑏+ .32
§ Summing out variable 𝐴, we get
𝐵 ∑!Θ!Θ”|! 𝑏 .62=.54+.08 𝑏+ .38=.06+.32
𝑎.6 𝑎𝑏.9 𝑎+.4 𝑎𝑏+.1
𝐵 𝐶 Θ$|” 𝑏 𝑐 .3 𝑏 𝑐̅ .7 𝑏+𝑐 .5
Variable Elimination (VE)
§ Now, we have two factors, and we want to eliminate variable 𝐵
Θ$|” ∑! Θ!Θ”|! 𝑐 .190
𝐵 ∑!Θ!Θ”|! 𝑏 .62
𝑏 𝑐 .186 𝑏 𝑐̅ .434 𝑏+
§ Summing out 𝐵
𝑏 𝑐 .3 𝑏 𝑐̅ .7 𝑏+ 𝑐 .5
∑” Θ$|” ∑! Θ!Θ”|!
Θ$|” 𝑏+ 𝑐̅ .5
Computing Prior Marginals (VE_PR1)
§ This algorithm provides the pseudocode for computing the marginal over some variables 𝑸
§ How much work does this algorithm do?
§ Note that 𝑓 and 𝑓. differs only one variable 𝜋(𝑖)
§For𝜋= 𝐴,𝐵
Input: Bayesian network 𝑁, query variables 𝑸, variable ordering 𝜋 Output: prior marginal 𝑃(𝑸)
1: 𝑺 ← CPTs of network 𝑁
2:for𝑖 =1tolengthoforder𝜋do
3: 𝑓 ← ∏7 𝑓7 where 𝑓7 belongs to 𝑺 and mentions variable 𝜋(𝑖)
4: 𝑓’ ←∑8(‘)𝑓
5: replace all factors 𝑓7 in 𝑺 by factor 𝑓’
6: return ∏;∈𝑺 𝑓
§For𝜋= 𝐵,𝐴
S Θ& S Θ(|&Θ)|( (& &(
S Θ)|( S Θ&Θ(|&
Computing Prior Marginals
§ Therefore, although any order will work
§ Some orders are better than others
§ Since they lead to constructing smaller intermediate factors
§ We need to find the best order
§ We will address this problem shortly
§ For now, let us try to formalize how to measure the quality of an elimination order
§ If the largest factor has 𝑤 variables, then the complexity of the lines 3-5 is 𝑂(𝑛 exp(𝑤))
§ This number is known as the width of the elimination order
§ We want to find the elimination order with the smallest width § The algorithm complexity is 𝑂(𝑛 exp 𝑤 + 𝑛 exp 𝑸 )
Elimination Order
§ Suppose we have two elimination orders 𝜋! and 𝜋5
§ We want to choose the one with smallest width
Winter? (𝐴)
§ We can modify the VE_PR1 to register the number of variables in
line 4 (𝐵)
§ The width is maximum number of variables any factor ever contained
§ Let us suppose we want com compute 𝑃(𝐶)
§ With an elimination order 𝐵, 𝐶, 𝐴, 𝐷
Sprinkler?
Wet Grass? (𝐷)
Slippery Road? (𝐸)
Θ!Θ”|!Θ$|!Θ&|”$Θ%|$
Θ!Θ$|!Θ%|$𝑓C(𝐴,𝐶,𝐷)
𝑓C = ∑” Θ”|! Θ&|”,$
Θ!𝑓B(𝐴,𝐷,𝐸)
𝑓B =∑EΘ$|!Θ%|$𝑓C(𝐴,𝐶,𝐷)
𝑓F =∑!Θ!𝑓B(𝐴,𝐷,𝐸)
𝑓G = ∑ & 𝑓F ( 𝐷 , 𝐸 )
Interaction Graph
§ We can compute the width of an order by simply operating on an undirected graph
§ Let 𝑓”, … , 𝑓, be a set of factors. The interaction graph 𝐺 of these factors is an undirected graph constructed as follows
§ The nodes of 𝐺 are the variables that appear in factors 𝑓”, … , 𝑓,
§ There is an edge between two variables in 𝐺 iff those variables
appear in the same factor
§ Another way to visualise the interaction graph is to realise that the variables 𝑿X of 𝑓X form a clique in 𝐺
§ For example, Θ&Θ(|&Θ)|&Θ$|()Θ*|)
Winter? (𝐴)
Sprinkler? (𝐵)
Wet Grass? (𝐷)
Slippery Road? (𝐸)
Winter? (𝐴)
Sprinkler? (𝐵)
Wet Grass? (𝐷)
Slippery Road?
Interaction Graph
Elimination order: 𝐵, 𝐶, 𝐴, 𝐷 𝐴𝐴𝐴
𝐵𝐶𝐶 𝐷𝐸𝐷𝐸𝐷𝐸
𝑆%: Θ6Θ5|6Θ3|6Θ4|53Θ2|3 𝑆>: Θ6Θ3|6Θ2|3𝑓%(𝐴, 𝐶, 𝐷)
𝑆?: Θ6𝑓>(𝐴, 𝐷, 𝐸)
𝑓?(𝐷, 𝐸) 𝑆A:
Interaction Graph
§ There are two key observations about interaction graphs
§ If 𝐺 is the interaction graph of factors 𝑆, then elimination a variable 𝜋(𝑖) from 𝑆 leads to constructing a factor over the neighbours of 𝜋(𝑖) in 𝐺
§ Let 𝑆’ be the factors that result from eliminating variable 𝜋(𝑖) from factors 𝑆. If 𝐺’ and 𝐺 are the interaction graphs of 𝑆’ and 𝑆, respectively, then 𝐺’ can be obtained from 𝐺 as follows
a) Add an edge to 𝐺 between every pair of neighbours of variable 𝜋(𝑖) that are not already connected
b) Deletevariable𝜋(𝑖)from𝐺
OrderWidth
Input: Bayesian network 𝑁, variable ordering 𝜋 Output: the width of 𝜋
𝐺 ← interaction graph of the CPTs in network 𝑁 𝑤←0
for𝑖 =1tolengthoforder𝜋do
𝑤 ← max(𝑤, 𝑑), where 𝑑 is the number of 𝜋(𝑖)’s neighbours in 𝐺
add an edge between every pair of non-adjacent neighbours of 𝜋(𝑖) in 𝐺 delete variable 𝜋(𝑖) from 𝐺
§ This algorithm provides pseudocode for computing the width of an elimination order § One application of OrderWidth is to measure the quality of an ordering before using it
§ However, when the number of orderings is large, we need to do better
Interaction Graph
§ Computing the optimal ordering is an NP-hard problem
§ But there are several heuristic approaches that provide good results § One of the most popular is also the simplest: min-degree heuristic
§ The min-degree heuristic eliminates the variable that leads to constructing the smallest factor possible
§ It means we should eliminate the variable with the smallest number of neighbours in the current graph
§ Min-degree is optimal when applied to a network with some elimination order of width ≤ 2
MinDegreeOrder
Input: Bayesian network 𝑁 with variables 𝑿 Output: an ordering 𝜋 of variables 𝑿
𝐺 ← interaction graph of the CPTs in network 𝑁 for𝑖 =1tonumberofvariablesin𝑿do
𝜋(𝑖) ← a variable in 𝑿 with smallest number of neighbours in 𝐺
add an edge between every pair of non-adjacent neighbours of 𝜋(𝑖) in 𝐺 delete variable 𝜋(𝑖) from 𝐺 and from 𝑿
§ There is another popular heuristic that is usually more effective than MinDegreeOrder
§ It consists in eliminating the variable that leads to adding the smallest number of edges in 𝐺, called fill-in edges
§ This heuristic is called fill-in heuristic
MinFillOrder
Input: Bayesian network 𝑁 with variables 𝑿 Output: an ordering 𝜋 of variables 𝑿
𝐺 ← interaction graph of the CPTs in network 𝑁 for𝑖 =1tonumberofvariablesin𝑿do
𝜋(𝑖) ← a variable in 𝑿 that adds the smallest number of edges in 𝐺
add an edge between every pair of non-adjacent neighbours of 𝜋(𝑖) in 𝐺 delete variable 𝜋(𝑖) from 𝐺 and from 𝑿
Computing Posterior Marginals
§ We now discuss an algorithm for computing the posterior marginal for a set of variables
Winter? (𝐴)
§ For instance, 𝑸 = { 𝐷, 𝐸} and 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒, 𝐵 = 𝑓𝑎𝑙𝑠𝑒 we get
the table on the right side (𝐵)
§ More generally, given a network 𝑁, query 𝑸 and evidence 𝒆
§ We want to compute the posterior marginal 𝑃(𝑸|𝒆)
§ Prior marginals is a special case of posterior marginals when 𝒆
is the trivial instantiation
Sprinkler?
Wet Grass? (𝐷)
Slippery Road? (𝐸)
𝑃(𝑸|𝒆) .448 .192 .112 .248
Computing Posterior Marginals
§ It is more useful to construct a variation called joint marginals, 𝑃(𝑸, 𝒆)
§ If we take 𝑄 = {𝐷, 𝐸} and 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒, 𝐵 = 𝑓𝑎𝑙𝑠𝑒, we get the joint marginal on the right side
§ If we add the probabilities in this factor, we get .48 §Thisistheprobabilityofevidence𝒆,since∑𝒒𝑃𝒒,𝒆 =𝑃(𝒆)
§ This means we can compute 𝑃(𝑸|𝒆) by simply normalizing 𝑃(𝑸, 𝒆)
§ We also get the probability of evidence 𝒆 for free
§ VE can be extended to compute joint marginals
§ We need start by zeroing out those rows that are inconsistent with evidence 𝒆
Winter? (𝐴)
Sprinkler? (𝐵)
𝐷 𝐸 𝑑 𝑒 𝑑̅ 𝑒̅ 𝑑 𝑒 𝑑̅ 𝑒̅
𝑃(𝑸, 𝒆) .21504 .09216 .05376 .11904
WetGrass? (𝐷)
Slippery Road? (𝐸)
Computing Posterior Marginals
§ The reduction of factor 𝑓(𝑿) given evidence 𝒆 is another factor over variables 𝑿, denoted by 𝑓h
𝑓 𝒙 , if 𝒙~𝒆 𝑓𝒆(𝒙) ≝ A0, otherwise
§ For example, given the factor 𝑓 and evidence 𝒆: 𝐸 =
𝐷𝐸𝑓 𝑑 𝑒 .448 𝑑̅ 𝑒̅ .192 𝑑̅ 𝑒 .112 𝑑 𝑒̅ .248
𝐷 𝐸 𝑓𝒆 𝑑𝑒 .448 𝑑̅ 𝑒̅ 0 𝑑̅𝑒 .112 𝑑𝑒̅ 0
𝐷𝐸 𝑓𝒆 𝑑̅𝑒 .448 𝑑𝑒 .112
𝑡𝑟𝑢𝑒, we obtain 𝑓
Computing Posterior Marginals
§ For this network, if 𝑸 = {𝐷, 𝐸} and 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒, 𝐵 = 𝑓𝑎𝑙𝑠𝑒. The joint marginal 𝑃(𝑸, 𝒆) is
𝑃(𝑸,𝒆) = S Θ&Θ(|&Θ)|&Θ$|()Θ*|) 𝒆 &,(,)
§ We can use the following result 𝑓” , 𝑓! 𝒆 = 𝑓”𝒆 𝑓!𝒆
Winter? (𝐴)
Sprinkler? (𝐵)
§ Therefore
𝑃(𝑸,𝒆) = S Θ𝒆Θ𝒆 Θ𝒆 Θ𝒆 Θ𝒆
& (|& )|& $|() *|) &,(,)
Wet Grass? (𝐷)
Slippery Road? (𝐸)
Computing Posterior Marginals
§ Consider this network. Let 𝑸 = {𝐶}, 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒 § We want to compute 𝑃(𝑸, 𝒆), by eliminating 𝐴 then 𝐵
§ We first reduce the network CPTs given evidence 𝒆
𝐴Θ! 𝐴𝐵 𝑎.6 𝑎𝑏.9 𝑎+.4 𝑎𝑏+.1
Θ”|! 𝑎+ 𝑏 .2
Θ$|” 𝑏+ 𝑐̅ .5
𝑏 𝑐 .3 𝑏 𝑐̅ .7 𝑏+ 𝑐 .5
Computing Posterior Marginals
§ Consider this network. Let 𝑸 = {𝐶}, 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒 § We want to compute 𝑃(𝑸, 𝒆), by eliminating 𝐴 then 𝐵
§ We first reduce the network CPTs given evidence 𝒆 § Then we need to evaluate
𝑃(𝑸, 𝒆) = S S Θ𝒆 Θ𝒆 Θ𝒆
(& =SΘ𝒆 SΘ𝒆Θ𝒆
𝐴 Θ6𝒆 𝑎 .6
𝑎 𝑏 .9 𝑎 𝑏+ .1
∑ Θ𝒆Θ𝒆 𝐵 ! & (|&
𝑏 𝑐 .3 𝑏 𝑐̅ .7 𝑏+ 𝑐 .5
𝑎𝑏.54 𝑏 .54 𝑎𝑏+.06 𝑏+ .06
Computing Posterior Marginals
§ Consider this network. Let 𝑸 = {𝐶}, 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒 § We want to compute 𝑃(𝑸, 𝒆), by eliminating 𝐴 then 𝐵
§ We first reduce the network CPTs given evidence 𝒆 § Then we need to evaluate
𝑃(𝑸,𝒆)=SSΘ𝒆Θ𝒆 Θ𝒆
=SΘ𝒆 SΘ𝒆Θ𝒆 )|( & (|&
3|5 6 6 5|6 𝐶 5 3|5 6 6 5|6
3|5 𝑏 𝑐 .3 𝑏𝑐̅.7 𝑏+𝑐.5
𝐵 𝐶 𝑏𝑐 𝑏𝑐̅ 𝑏+𝑐 𝑏+ 𝑐̅
∑ Θ𝒆Θ𝒆 ∑ Θ𝒆 ∑ Θ𝒆Θ𝒆
.162 𝑐 .192 .378 𝑐̅ .408 .030
Computing Posterior Marginals
§ Consider this network. Let 𝑸 = {𝐶}, 𝒆: 𝐴 = 𝑡𝑟𝑢𝑒 § We want to compute 𝑃(𝑸, 𝒆), by eliminating 𝐴 then 𝐵
§ We first reduce the network CPTs given evidence 𝒆 § Then we need to evaluate
𝑃(𝑸,𝒆)=SSΘ𝒆Θ𝒆 Θ𝒆
=SΘ𝒆 SΘ𝒆Θ𝒆 )|( & (|&
§ To compute 𝑃(𝐶|𝐴 = 𝑡𝑟𝑢𝑒)
§ We need to normalize this factor, which gives
𝒆 𝒆 𝒆 𝐶 ∑5 Θ3|5 ∑6 Θ6Θ5|6
𝑐 .192 𝑐̅ .408
𝐶 𝑃(𝐶|𝐴 = 𝑡𝑟𝑢𝑒) 𝑐 .32
Computing Joint Marginals (VE_PR2)
Input: Bayesian network 𝑁, query variables 𝑸, variable ordering 𝜋, evidence 𝒆
Output: joint marginal 𝑃(𝑸, 𝒆)
1: 𝑺 ← {𝑓𝒆: 𝑓 is a CPTs of network 𝑁} 2:for𝑖 =1tolengthoforder𝜋do
3: 𝑓 ← ∏7 𝑓7 where 𝑓7 belongs to 𝑺 and mentions variable 𝜋(𝑖)
4: 𝑓’ ←∑8(‘)𝑓
5: replace all factors 𝑓7 in 𝑺 by factor 𝑓’
6: return ∏;∈𝑺 𝑓
§ It is not uncommon to run VE_PR2 with empty 𝑸
§ The algorithm will return a trivial factor with the probability of evidence 𝒆
Network Structure and Complexity
§ Suppose we have two Bayesian networks with 100 variables
§ We find the best elimination order for both
§ The first one has width 3, and therefore, good performance
§ The second one has width 25, and poor performance regardless of the elimination order
§ Why is the second network more difficult given that both have the same number of variables?
§ The answer lies in the notion of treewidth
§ It is a number that quantifies the extend a network resembles a tree structure
§ No elimination order can have a width smaller than the network treewidth
§ There is an elimination order whose width equals the network treewidth. Yet determining such an order is NP-hard
§ Intuitively, treewidth is a number that quantifies the extent a network resembles a tree structure
§ The treewidth can be defined as the width of this best complete elimination order
§ A complete elimination order has all network variables
Network with treewidth = 1
Treewidth: Intuition
§ The number of nodes has no effect on treewidth
§ For example, these two networks have treewidth = 1
§ But the second one has the twice the number of nodes
Treewidth = 1
Treewidth = 1
Treewidth: Intuition
§ The number of parents per node has a direct effect on treewidth
§ If the number of parents can reach 𝑘, the treewidth is no less than 𝑘
Treewidth = 1
Treewidth = 2
Treewidth: Intuition
§ Loops tend to increase treewidth
§ For example, the second network was obtained from the first by introducing some loops
§ Notice the maximum number of parents per node did not increase
Treewidth = 2
Treewidth = 3
Treewidth: Intuition
§ The number of loops per se does not have a genuine effect on treewidth
§ It is the nature of the loops which does
§ Their interaction in particular
§ These two networks have the same treewidth, although the first has more loops
Treewidth = 3
Treewidth = 3
Classes of Networks with Known Treewidth
§ Polytrees networks have at most one (undirected) path between two nodes
§ Also known as singly connected networks
§ The treewidth of polytrees is 𝑘, where 𝑘 is the maximum number of parents of any node.
Treewidth = 2
Classes of Networks with Known Treewidth
§ Tree networks are polytrees where each node has at most one parent
§ Leading to a treewidth of at most 1
Treewidth = 1 Treewidth = 1
Classes of Netwo
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com