COMP9418: Advanced Topics in Statistical Machine Learning
Markov Networks
Instructor: University of Wales
Copyright By PowCoder代写 加微信 powcoder
Introduction
§ This lecture discusses Markov networks
§ These are undirected graphical models
§ They are frequently used to model symmetrical dependencies, as in case of pixels in an image
§ Like Bayesian networks, Markov networks are used to model variable independencies
§ However, these representations are not redundant
§ There exist sets of independencies that can be expressed in a Markov network but not in a Bayesian network and vice-versa
§ We will discuss the semantics of Markov networks
§ As well as some inference algorithms such as stochastic search and variable elimination
Introduction
§ Several processes such as an sentence or image can be modelled as a series of states in a chain or grid
§ Each state can be influenced by the state of its neighbours
§ Such symmetry is modelled using undirected graphs called
Markov random fields (MRFs) or Markov networks (MN)
§ MNs were proposed to model ferromagnetic materials
§ In Physics, these models are known as Ising models
§ Each variable represents a dipole with two states + and –
§ The state of each dipole depends on an external field and its neighbours
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋! 𝑋” 𝑋# 𝑋$ 𝑋% 𝑋& 𝑋’ 𝑋( 𝑋) 𝑋!* 𝑋!! 𝑋!”
Introduction
§ In an MN, a variable is independent of all other variables given its neighbours
§ For instance, in this figure, 𝑋! ⊥ 𝑋#, 𝑋$|𝑋” § Therefore, 𝑃 𝑋! 𝑋”, 𝑋#, 𝑋$ = 𝑃(𝑋!|𝑋”)
§ A common query is to find the instantiation of maximum probability
§ MAP or MPE query
§ The probability of each instantiation depends on an external
influence (prior) and the internal influence (likelihood)
§ MNs can be thought as a series of rings in poles, where each ring is a variable, and the height of a ring corresponds to its state
𝑋! 𝑋” 𝑋# 𝑋$
Voting Example
§ Suppose that we are modeling voting preferences among four persons 𝐴, 𝐵, 𝐶, 𝐷
§ Let’s say that 𝐴 − 𝐵, 𝐵 − 𝐶, 𝐶 − 𝐷, and 𝐷 − 𝐴 are friends
§ Friends can influence each other
§ These influences can be naturally represented by an undirected graph
§ In this example, 𝐴 does not interact directly with 𝐶. The same occurs with 𝐵 and 𝐷
§ 𝐴 ⊥ 𝐶|𝐵, 𝐷 and 𝐵 ⊥ 𝐷|𝐴, 𝐶
§ We saw there is no Bayesian network that can represent only these independence assumption (Lecture 4 – Slide 33)
Voting Example
§ Like Bayesian networks, Markov networks encode independence assumptions
§ Variables that are not independent must be in some factor
§ Factor is a generalization of a CPT. It does not need to store values in the range 0 − 1
§ In this example, we can factorise the joint distribution as
𝑃(𝐴,𝐵,𝐶,𝐷)= +!𝜙!(𝐴,𝐵)𝜙”(𝐵,𝐶)𝜙#(𝐶,𝐷)𝜙$(𝐷,𝐴)
§ 𝑍 is a normalizing constant known as the partition function
§𝑍=∑ 𝑃9(𝐴,𝐵,𝐶,𝐷)
,,.,/,0 𝑐̅ 𝑑̅ 1 𝑏1𝑐̅
§ 𝑃9 ( 𝐴 , 𝐵 , 𝐶 , 𝐷 ) = 𝜙 ! ( 𝐴 , 𝐵 ) 𝜙 ” ( 𝐵 , 𝐶 ) 𝜙 # ( 𝐶 , 𝐷 ) 𝜙 $ ( 𝐷 , 𝐴 )
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑𝑎 100 𝑎𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
̅1 𝑑𝑎1 100 𝐴 𝑎1𝑏
𝐶𝐷 𝜙#(𝐶,𝐷)
𝑐 𝑑̅ 1𝑏𝑐 𝑐 𝑑 100 𝑏 𝑐̅ 𝑐̅ 𝑑 100 𝑏1 𝑐
100 1 1 100
Voting Example
§ We can view 𝜙(𝐴, 𝐵) as an interaction that pushes B’s vote closer to that of A
§ The term 𝜙(𝐵, 𝐶) pushes 𝐵’s vote closer to 𝐶, but 𝐶 pushes 𝐷’s vote away (and vice-versa).
§ The most likely vote will require reconciling these conflicting influences
§ We simply indicate a level of coupling between dependent variables in the graph
§ This requires less prior knowledge than CPTs
§ It defines an energy landscape over the space of possible
assignments
§ We convert this energy to a probability via the normalization constant
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑𝑎 100 𝑎𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
𝑑𝑎1 𝑎1𝑏1 ̅1
𝑑𝑎1 100 𝐴 𝑎1𝑏 10 𝐷𝐵
𝐶 𝐷 𝜙#(𝐶,𝐷) 𝑐𝑑̅ 1
𝐵 𝐶 𝜙”(𝐵,𝐶) 𝑏𝑐 100
𝑐𝑑100 𝑏𝑐̅1 𝑐̅ 𝑑 100 𝑏1 𝑐 1 𝑐̅ 𝑑̅ 1 𝑏1 𝑐̅ 100
Voting Example
MPE assignment
Assignment
Unnormalized
Normalized
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑𝑎 100 𝑎𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
𝑑𝑎1 𝑎1𝑏1 ̅1
𝑑𝑎1 100 𝐴 𝑎1𝑏 10 𝐷𝐵
𝐶 𝐷 𝜙#(𝐶,𝐷) 𝑐𝑑̅ 1
𝐵 𝐶 𝜙”(𝐵,𝐶) 𝑏𝑐 100
𝑐𝑑100 𝑏𝑐̅1 𝑐̅𝑑100 𝑏1𝑐1 𝑐̅𝑑̅ 1 𝑏1𝑐̅ 100
Voting Example
§ Although expensive, the joint probability can be used to answer probabilistic queries
§ Prior marginal queries, such as 𝑃(𝐴, 𝐵)
𝑃(𝐴,𝐵) 𝑎𝑏 .69
§ Probability of evidence, such as 𝑃 𝑏1
§ Posterior marginal, such as 𝑃 𝑏1 𝑐 = 0.06
Assignment
Unnormalized
Normalized
Voting Example: Bad News for Learning!
§ Suppose we had learned 𝑃(𝐴, 𝐵) from data § By counting the occurrences of 𝑎 and 𝑏
§ 𝑃(𝐴, 𝐵) is not a direct replacement for 𝜙!(𝐴, 𝐵)
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑𝑎 100 𝑎𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
𝑑𝑎1 𝑎1𝑏1 ̅1
𝑑𝑎1 100 A 𝑎1𝑏 10
𝑃(𝐴,𝐵) 𝑎𝑏 .69
𝐶 𝐷 𝜙#(𝐶,𝐷) 𝑐𝑑̅ 1
𝐵 𝐶 𝜙”(𝐵,𝐶) 𝑏𝑐 100
𝑐𝑑100 𝑏𝑐̅1 𝑐̅𝑑100 𝑏1𝑐1 𝑐̅𝑑̅ 1 𝑏1𝑐̅ 100
Random Field
§ A random field 𝑿 is a set of random variables
§ It is common that each variable 𝑋1 to be associated with a site
§ This idea comes from areas such as image processing in which each variable is associated with a pixel or region
§ We use a set 𝑺 to index a set of 𝑛 sites
§ The sites can be spatially regular, as in the case of a 2D image § Or irregular, if they do not present spatial regularity
§ The sites in 𝑺 are related to one another via a neighborhood system
§ A site is not neighboring to itself: 𝑖 ∉ 𝑵1
§ The neighboring relationship is mutual: 𝑖 ∈ 𝑵1’ iff 𝑖’ ∈ 𝑵1
𝑿= 𝑋%,…,𝑋&
𝑋!,! 𝑋!,# 𝑋!,$ 𝑋!,% 𝑋#,! 𝑋#,# 𝑋#,$ 𝑋#,% 𝑋$,! 𝑋$,# 𝑋$,$ 𝑋$,%
𝑵1 is a set of sites neighboring 𝑖 𝑵 = {𝑵1|∀𝑖 ∈ 𝑺}
Markov Networks
§ A random field 𝑿 is a Markov random field (or Markov network) on 𝑺 w.r.t. a neighbourhood system 𝑵 if and only if
§ 𝑃 𝑋! = 𝑥!,…,𝑋3 = 𝑥3 > 0,∀𝒙 ∈ 𝑿 (positivity)
§ 𝑃 𝑋1 𝑿𝑺∖ 1 = 𝑃(𝑋1|𝑿𝑵!) (Markovianity)
§ Graphically, Markov networks (MN) are undirected graphical models
§ 𝐺 = (𝑽, 𝑬), where 𝑽 consists of a set of random variables, and 𝑬 a set of undirected edges
§ A set of variables 𝑿 is independent of 𝒀 given 𝒁, if the variables in 𝒁 separate 𝑿 and 𝒀 in the graph
§ Therefore, if we remove the nodes in 𝒁 from the graph, there will be no paths between 𝑿 and 𝒀
Markov Networks:
§ When the positivity condition is satisfied, the joint probability distribution is uniquely determined by the Gibbs distribution
§ This result is known as the Hammersley-Clifford theorem
§ Like in Bayesian networks, it allow us to factorise the full joint
distribution into smaller factors
§ Therefore, we can efficiently answer probabilistic queries
§ Using the example, we have the following factorisation for maximal cliques
§ 𝑃(𝐴,𝐵,𝐶,𝐷) = +! 𝜙! 𝐴,𝐵,𝐷 𝜙”(𝐵,𝐶,𝐷)
§ In practice, we frequently use smaller cliques such as pairwise
§ 𝑃(𝐴,𝐵,𝐶,𝐷) = +! 𝜙! 𝐴,𝐵 𝜙” 𝐵,𝐶 𝜙# 𝐶,𝐷 𝜙$(𝐷,𝐴)𝜙%(𝐷,𝐵) 𝐶
𝑃𝑿=𝑍1 R 𝜙7(𝑿7) 7∈791:;<=(?)
𝑍=S R 𝜙7(𝑿7) 𝒙 7∈791:;<=(?)
Markov Networks: Positivity
§ This graph encodes the independencies
§ 𝐴 ⊥ 𝐶|𝐵, 𝐷 and 𝐷 ⊥ 𝐵|𝐴, 𝐶
§ Let us verify if this joint distribution has the same independence assumptions
𝐵 𝐷 𝐴 𝑃(𝐴|𝐵,𝐷) 𝐵 𝐷 𝐶 𝑃(𝐶|𝐵,𝐷) .5 1
1 .5 0 .5 0 .5 1 .5
𝑃(𝐴, 𝐶|𝐵, 𝐷) .5
𝐴 𝐵 𝐶 𝐷 𝑃(.) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Markov Networks: Positivity
§ This graph encodes the independencies
§ 𝐴 ⊥ 𝐶|𝐵, 𝐷 and 𝐷 ⊥ 𝐵|𝐴, 𝐶
§ Let us verify if this joint distribution has the same independence assumptions
𝐴 𝐶 𝐵 𝑃(𝐵|𝐴,𝐶) 𝐴 𝐶 𝐷 𝑃(𝐷|𝐴,𝐶) 1 .5
.5 0 .5 1 .5 1 .5 0
𝑃(𝐵, 𝐷|𝐴, 𝐶) .5
𝐴 𝐵 𝐶 𝐷 𝑃(.) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Markov Networks: Positivity
§ This graph encodes the independencies
§ 𝐴 ⊥ 𝐶|𝐵, 𝐷 and 𝐷 ⊥ 𝐵|𝐴, 𝐶
§ Let us verify if this joint distribution has the same independences assumptions
§𝑃 𝑎1,𝑏,𝑐,𝑑̅ =𝜙! 𝑎1,𝑏 𝜙" 𝑏,𝑐 𝜙# 𝑐,𝑑̅ 𝜙$ 𝑑̅,𝑎 =0 §𝑃 𝑎1,𝑏,𝑐,𝑑 =𝜙! 𝑎1,𝑏 𝜙" 𝑏,𝑐 𝜙# 𝑐,𝑑 𝜙$ 𝑑,𝑎 =!(
§ 𝑃 𝑎1,𝑏1,𝑐̅,𝑑̅ = 𝜙! 𝑎1,𝑏1 𝜙" 𝑏1,𝑐̅ 𝜙# 𝑐̅,𝑑̅ 𝜙$ 𝑑̅,𝑎1 = !(
§𝑃 𝑎,𝑏,𝑐,𝑑̅ =𝜙! 𝑎,𝑏 𝜙" 𝑏,𝑐 𝜙# 𝑐,𝑑̅ 𝜙$ 𝑑̅,𝑎 =!( 1/8
1/8 1/8 1/8 1/8 1/8
𝜙% 𝐷,𝐴 𝐴 𝜙! 𝐴,𝐵 𝐷𝐵
𝐴 𝐵 𝐶 𝐷 𝑃(.) 1/8 1/8
§ Different Gibbs distributions may induce a same undirected graph
§ 𝜙! 𝐴,𝐵,𝐷 𝜙"(𝐵,𝐶,𝐷)
§ 𝜙! 𝐴,𝐵,𝐷 𝜙" 𝐵,𝐷 𝜙# 𝐵,𝐶 𝜙$(𝐶,𝐷)
§ 𝜙! 𝐴,𝐵 𝜙"(𝐴,𝐷)𝜙# 𝐵,𝐷 𝜙$ 𝐵,𝐶 𝜙%(𝐶,𝐷)
§ Therefore, we cannot read the factorization from the graph
§ All these factorizations have the same independence assumptions
§ However, they do not have the same representational power
§ For example, for a fully connected graph, a maximal clique has
𝑂(𝑑3) parameters, but a pairwise graph has only 𝑂(𝑛"𝑑") parameters
Potentials
§ Clique factors can be:
§ Single-node factors: specify an affinity for a
particular candidate
§ Pairwise-factors: enforce affinities between friends
𝜙,- 𝑎,𝑏 =100if𝑎=𝑏
𝜙,-. 𝑎,𝑏,𝑐 =100if𝑎⊕𝑏⊕𝑐 The normalization 𝑍 makes the factors scale invariant!
§ Higher-order: important to specify relationships among sets of variables
Factor Graphs
§ A factor graph is a graph containing two types of nodes
§ Random variables
§ Factors over the sets of variables
𝑋# 𝑋$ 𝑋% 𝑋& 𝑋!
𝑋# 𝑋$
𝑋% 𝑋& ambiguity 𝑋
§ It allow us to derive the factorization without §𝑃𝑋!,𝑋",𝑋#,𝑋$,𝑋% = !
𝑃 𝑋!,𝑋",𝑋# 𝑃 𝑋",𝑋#,𝑋$ 𝑃(𝑋#,𝑋%)
§𝑃𝑋!,𝑋",𝑋#,𝑋$,𝑋% =
𝑃 𝑋!,𝑋" 𝑃 𝑋!,𝑋# 𝑃 𝑋",𝑋# 𝑃(𝑋",𝑋$)𝑃(𝑋#,𝑋$)𝑃(𝑋#,𝑋%)
§ 𝑃 𝑋!,𝑋",𝑋#,𝑋$,𝑋% =
𝑃 𝑋!,𝑋" 𝑃 𝑋!,𝑋# 𝑃 𝑋",𝑋#,𝑋$ 𝑃(𝑋#,𝑋%)
𝑋# 𝑋$ 𝑋% 𝑋&
Energy Functions
§ The joint probability in a MN is frequently expressed in terms of energy functions
§ 𝐸(𝑿) is the energy. Therefore, maximising 𝑃(𝑿) is equivalent to minimising 𝐸(𝑿)
§ The energy function can be written in terms of local functions 𝜓/ known as potentials
§ Historical: statistical physics
𝑃 𝑿 = 𝑍1 e x p ( − 𝐸 𝑿 )
𝐸𝑿= S 𝜓7(𝑿7) 7∈/91:;<=(?)
𝑃 𝑿 = 𝑍1 e x p −
𝜓𝑿7 =−log𝜙7(𝑿7)
S 𝜓 7 ( 𝑿 7 ) 7∈/91:;<=(?)
Voting Example
Assignment
Unnormalized
Normalized
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑𝑎 100 𝑎𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
𝑑𝑎1 𝑎1𝑏1 ̅1
𝑑𝑎1 100 𝐴 𝑎1𝑏 10 𝐷𝐵
𝐶 𝐷 𝜙#(𝐶,𝐷) 𝑐𝑑̅ 1
𝐵 𝐶 𝜙"(𝐵,𝐶) 𝑏𝑐 100
𝑐𝑑100 𝑏𝑐̅1 𝑐̅𝑑100 𝑏1𝑐1 𝑐̅𝑑̅ 1 𝑏1𝑐̅ 100
Assignment
Voting Example
Unnormalized
Normalized
𝐷 𝐴 𝜓$(𝐷,𝐴) 𝐴 𝐵 𝜓!(𝐴,𝐵) 𝑑 𝑎 𝑎 𝑏 -3.40
1 𝑑̅𝑎10 𝑎𝑏
𝑑𝑎0 𝑎1𝑏0 ̅1
𝑑 𝑎1 -4.61 𝐴 𝑎1 𝑏 -2.30 𝐷𝐵
𝐶 𝐷 𝜓#(𝐶,𝐷) 𝑐 𝑑̅ 0
𝐵 𝐶 𝜓"(𝐵,𝐶) 𝑏 𝑐 -4.61 𝑏 𝑐̅ 0 𝑏1𝑐 0
𝑏1 𝑐̅ -4.61 22
𝑐 𝑑 -4.61 𝑐̅𝑑-4.61 𝑐̅ 𝑑̅ 0
Pairwise Markov Networks
§ Common subclass of Markov networks
§ All the factors are over single variables or pairs of variables § Node potentials: 𝜓 𝑋1 : 𝑖 = 1,...,𝑛
§ Edge potentials: {𝜓 𝑋1, 𝑋B : 𝑋1, 𝑋B ∈ 𝐻}
§ Application: noise removal from binary images § Noisy image of pixel values, 𝑌1,B
§ Noise-free image of pixel values, 𝑋1,B
§ Markov Net with
§ 𝜙 𝑋',(, 𝑋'’,(’ potentials representing correlations between neighbouring pixels
𝑋!,! 𝑋!,# 𝑋!,$ 𝑋!,% 𝑋#,! 𝑋#,# 𝑋#,$ 𝑋#,% 𝑋$,! 𝑋$,# 𝑋$,$ 𝑋$,%
§ 𝜙 𝑋',(,𝑌',( potentials describing correlations between same pixels in noise-free and noisy image
Example: Image Smoothing
§ Many applications of Markov networks involve finding the MAP or MPE assignment
§ This is known as the MAP-MRF approach
§ Given the Gibbs distribution, it is equivalent to minimize the energy function 𝑌',(
§ The number of possible assignments is very large
§ It increases exponentially with the number of variables in the network
§ For instance, for a binary image of 100 x 100 pixels, there are 2!*,*** possible assignments
§ Finding the assignment of minimal energy is usually posed as a stochastic search
§ Start with a random value for each variable in the network § Improve this configuration via local operations
§ Until a configuration of (local) minimum energy is found
Stochastic Search Algorithm
Input: Markov network 𝑁 with variables 𝑿, energy function 𝐸 Output: an assignment 𝒔 for 𝑿 with minimum (local) energy 𝒔 ← initial assignment for every variable 𝑋' ∈ 𝑿
for 𝑖 = 1 to 𝐼 # 𝐼 is maximum number of iterations
for each variable 𝑋' ∈ 𝑿 do
𝑠'’ ← alternative value for variable 𝑋' if𝐸(𝒔’)<𝐸(𝒔)orrandom(𝐸 𝒔’ −𝐸 𝒔 )<𝑇then
if 𝐸 𝒔𝒑𝒓𝒆𝒗 −𝐸 𝒔 <𝜖
𝒔𝒑𝒓𝒆𝒗 ← 𝒔 return 𝒔
# 𝑇 is threshold of accepting a change to a higher energy state #𝜖isaconvergencethreshold
Example: Image Smoothing
§ This algorithm has three main variations
§ Iterative Conditional Modes (ICM): it always selects
the assignment of minimum energy
§ Metropolis: with a fixed probability, 𝑝, it selects an assignment with higher energy
§ Simulated annealing (SA): with a variable probability, 𝑃(𝑇), it selects an assignment with higher energy. 𝑇 is a parameter known as temperature. The probability of selecting a value with higher energy is determined by the expression 𝑃(𝑇) = 𝑒ABC/E where 𝛿𝐸 is the energy difference. The value of 𝑇 is reduced with each iteration
Local Independence
§ In a Markov network the absence of edges imply in independence
§ Given an undirected graph 𝐺 = (𝑽, 𝑬)
§ If the edge 𝑋 − 𝑌 ∉ 𝑬 then 𝑋 ⊥ 𝑌|𝑽 ∖ {𝑋, 𝑌}
§ These are known as pairwise Markov independencies of 𝐺
§ Another local property of independence is the Markov blanket
§ As in the case of Bayesian networks, the Markov blanket 𝑈 of a variable 𝑋 is the set of nodes such that 𝑋 is independent from the rest of the graph if 𝑈 is observed
§ In the undirected case the Markov blanket turns out to be simply equal a node’s neighborhood
Global Independence: Separation
§ A global interpretation of independence uses the idea of separation
§ Let 𝑿, 𝒀, and 𝒁 be disjoint sets of nodes in a graph 𝐺. We say that 𝑿 and 𝒀 are separated by 𝒁, written 𝑠𝑒𝑝G (𝑿, 𝒁, 𝒀), iff every path between a node in 𝑿 and a node in 𝒀 is blocked by 𝒁
§ A path is blocked by 𝒁 iff at least one valve on the path is closed given 𝒁
§ Like Bayesian networks. But now, there is not the exception of convergent structures
Separation: Complexity
§ The definition of separation considers all paths connecting a node in 𝑿 with a node in 𝒀
§ In practice, this test is too inefficient § We can replace it by a cut-set test
§ Two sets 𝑿 and 𝒀 of variables are separated by a set 𝒁 iff
§ There is no path from every node 𝑋 ∈ 𝑿 to every node 𝑌 ∈ 𝒀 after removing all nodes in 𝒁
§ 𝒁 is a cut-set between two parts of the graph
Separation: Soundness and Completeness
§ Like d-separation, separation test is sound
§ If 𝑃 is a probability distribution induced by a Markov
network then 𝑠𝑒𝑝G(𝑿, 𝒁, 𝒀) only if 𝑿 ⊥ 𝒀 | 𝒁
§ We can safely use separation test to derive independence statements about the probability distributions induced by Markov networks
§ Like d-separation, separation test is not complete
§ The lack of separation does not imply into dependency
§ This is expected. As d-separation, separation only looks at the graph structure
𝐴 𝜙! 𝑎5 𝑎1 10
𝐴 𝐵 𝜙!,# 𝐴 𝑎𝑏1
𝐵 𝜙# 𝑎2 𝑎1 20
Markov VS Bayesian Networks
Markov Nets
§ Factors are easy to change (no normalization), but difficult to elicit
§ Can be applied to problems with cycles or no natural directionality
§ Difficult to read the factorization from the graph, but we can use factor graphs
§ 𝑍 requires summing over all entries (NP-hard)
Bayes Nets
§ Factors are easy to elicit from people
§ Must have no cycles and edges are directed
§ Graphs are easy to interpret particularly the causal ones
§ Naturally normalized
§ Easy to generate synthetic data from it (more about this later)
Markov VS Bayesian: Representation
§ Bayesian and Markov networks can be understood as languages to represent independencies
§ These languages can represent different sets of independencies § Therefore, these representations are not redundant
§ For example, there is no directed graph that is a perfect map for the top case
§ Conversely, there is no undirected graph that is a perfect map for the bottom case
§ In several circumstances, we need to find a Markov network that is an I-MAP for a Bayesian network
§ This is achievable through moralisation
§ We connect the parents of unmarried child nodes § We lose the marginal independence of parents
𝑋! ⊥ 𝑋" | 𝑌!, 𝑌" 𝑌! ⊥ 𝑌" | 𝑋!, 𝑋"
𝑌! 𝑌# 𝑋#
𝑋! 𝑋# 𝑌
𝑋! 𝑋# 𝑌
Variable Elimination
§ Let us now consider if Variable Elimination (VE) works for Markov networks
§ The idea of VE is to anticipate the elimination of variables § Using the network example, suppose we want to compute
§ We start with the Gibbs distribution 𝑃(𝐴,𝐵) = ∑/ ∑0 𝑃(𝐴,𝐵,𝐶,𝐷)
=∑/∑0+!𝜙! 𝐴,𝐵 𝜙" 𝐵,𝐶 𝜙# 𝐶,𝐷 𝜙$(𝐷,𝐴)
∝ ∑/∑0𝜙! 𝐴,𝐵 𝜙" 𝐵,𝐶 𝜙# 𝐶,𝐷 𝜙$(𝐷,𝐴) =𝜙! 𝐴,𝐵 ∑/𝜙" 𝐵,𝐶 ∑0𝜙# 𝐶,𝐷 𝜙$(𝐷,𝐴)
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑 𝑎 100 𝑎 𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
𝑑𝑎1 𝑎1𝑏1 ̅1
𝑑 𝑎1 100 𝐴 𝑎1 𝑏 10 𝐷𝐵
𝐶 𝐷 𝜙#(𝐶,𝐷) 𝑐 𝑑̅ 1
𝐵 𝐶 𝜙"(𝐵,𝐶) 𝑏 𝑐 100
𝑐̅𝑑100 𝑏1𝑐1 𝑐̅ 𝑑̅ 1 𝑏1 𝑐̅ 100
Variable Elimination
𝑐 𝑑 𝑎 100 𝑐 𝑑̅ 𝑎1 1 𝑐𝑑̅𝑎 100 𝑐 𝑑 𝑎1 10000 𝑐̅ 𝑑 𝑎 10000 𝑐̅ 𝑑̅ 𝑎1 100 𝑐̅𝑑̅𝑎 1
𝑐̅ 𝑑 𝑎1 100
𝐷 𝐴 𝜙$(𝐷,𝐴) 𝐴 𝐵 𝜙!(𝐴,𝐵) 𝑑 𝑎 100 𝑎 𝑏 30
1 𝑑̅𝑎11 𝑎𝑏5
𝑑𝑎1 𝑎1𝑏1 ̅1
𝑑 𝑎1 100 𝐴 𝑎1 𝑏 10 𝐷𝐵
𝐶𝐷𝜙(𝐶,𝐷)𝐶 𝐵𝐶𝜙(𝐵,𝐶)
𝑐 𝑑̅ 1 𝑐 𝑑 100
𝑏 𝑐̅ 1 𝑐̅𝑑100 𝑏1𝑐1
𝑐̅ 𝑑̅ 1 𝑏1 𝑐̅ 100
Variable Elimination
§ Let us now consider if Variable Elimination (VE) works for Markov networks
§ The idea of VE is to anticipate the e
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com