COMP9418: Advanced Topics in Statistical Machine Learning
Bayesian Networks
Instructor: University of Wales
Copyright By PowCoder代写 加微信 powcoder
Introduction
§ This lecture introduces Bayesian networks as a modelling tool to specify joint probability distributions
§ The size of a joint distribution is exponential in the number of variables
§ This causes modelling and computational difficulties
§ The specification of a joint distribution may hide some relevant properties such as independencies
§ Bayesian networks is a graphical modelling tool for specifying probability distributions
§ It relies on the insight that independence is a significant aspect of beliefs, and § Independencies can be elicited using the language of graphs
Graphs and Independence
§ This figure is a directed acyclic graph (DAG)
§ Nodes represent variables
§ Let us assume (for now) that edges represent “direct causal influence”
§ For example, alarm triggering (𝐴) causes a call for a neighbour (𝐶)
§ Given this representation, we expect the belief dynamic to satisfy some properties
§ For instance, 𝐶 is influenced by evidence on 𝑅
§ A radio report would increase belief in Alarm. In turn,
increase belief in a call from a neighbour
§ However, the belief in 𝐶 would not increase if we knew the alarm did not trigger
§ 𝐶 ⊥ 𝑅|¬𝐴
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
§ Given a variable 𝑉 in a DAG 𝐺
§ Parents(𝑉) are the parents of 𝑉 in DAG 𝐺, that is, the
set of variables 𝑁 with an edge from 𝑁 to 𝑉
§ Descendants(𝑉) are the descendants of 𝑉 in 𝐺, that is,
the set of variables 𝑁 with a direct path from 𝑉 to 𝑁
§ Non_Descendants(𝑉) are all variables in 𝐺 other than 𝑉,
Parents(𝑉), and Descendants(𝑉)
§ A DAG 𝐺 is a compact representation of the following independence statements
§ 𝑉 ⊥ 𝑁𝑜𝑛_𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠 𝑉 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠(𝑉)
§ Every variable is conditionally independent of its
nondescendants given its parents
§ Markovian assumptions of G denoted by Markov(𝐺)
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
Markovian Assumptions
§ If we view DAG 𝐺 as a causal structure § Parents(𝑉) are direct causes of 𝑉
§ Descendants(𝑉) denotes the effects of 𝑉
§ Given the direct causes of a variable, our beliefs in that variable will no longer be influenced by any other variable except possibly by its effects
§ These are all the statements in this DAG
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
Markov Assumptions
§ If we view DAG 𝐺 as a causal structure § Parents(V) are direct causes of V
§ Descendants(V) denotes the effects of V
§ Given the direct causes of a variable, our beliefs in that variable will no longer be influenced by any other variable except possibly by its effects
§ These are all the statements in this DAG § 𝐶 ⊥ 𝐵, 𝐸, 𝑅 | 𝐴
§ 𝑅 ⊥ 𝐴, 𝐵, 𝐶 | 𝐸 § 𝐴 ⊥ 𝑅 | 𝐵, 𝐸
§ 𝐵 ⊥ 𝐸, 𝑅 §𝐸⊥𝐵
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
Markov Assumptions
§ Suppose we want to make a probability distribution that captures the state of belief
§ The first step is to construct the graph, ensuring the independences on 𝐺 matches our beliefs
§ The DAG 𝐺 is a partial specification. It says that 𝑃 must satisfy Markov(𝐺)
§ The specification of 𝐺 restricts the choices for the distribution 𝑃
§ However, it does not uniquely define it
§ We need to augment 𝐺 with a set of conditional
probabilities
§ The conditional probabilities and 𝐺 are guaranteed to uniquely define the distribution 𝑃
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
§ The formal interpretation of a DAG is a set of conditional independences
§ It makes no reference to causality
§ However, we used causality to motivate this
interpretation
§ It is perfectly possible to have a DAG that does not match our causal perception
§ We will see that every independence in the first graph is also present in the second
§ We discuss next the graph parametrization (quantifying dependencies between notes and parents)
§ This process is much easier to accomplish by an expert if the DAG corresponds to causal perceptions
Earthquake? Burglary? (𝐸) (𝐵)
Radio? (𝑅)
Alarm? (𝐴)
Earthquake? Burglary?
Radio? (𝑅)
Alarm? (𝐴)
Parametrisation
§ The conditional probabilities we need to specify are
§ For every variable 𝑋 in DAG 𝐺 and its parents 𝑼
§ Provide the probabilities 𝑃(𝑥|𝒖) for every value
𝑥 of 𝑋 and every instantiation 𝒖 of parents 𝑼
§ For example, for this graph, we need to
§ 𝑃(𝐵|𝐴), 𝑃(𝐸|𝐶), 𝑃(𝐶|𝐴), 𝑃(𝐴), 𝑃(𝐷|𝐵, 𝐶)
§ Each table is known as a conditional probability
table (CPT)
§ Noticethat∑!𝑃(𝑥|𝒖)=1foreach𝒖∈𝑼
§ Therefore, we only need 11 probabilities to specify the CPTs of this graph
𝐴 𝑎 𝑎 𝑎C 𝑎C
𝐵 Θ”|! 𝑏 .2 𝑏C .8 𝑏 .75 𝑏C .25
𝐶 𝐷 Θ&|”,$ 𝑐 𝑑̅ .95 𝑐 𝑑 .05 𝑐̅ 𝑑̅ .9 𝑐̅ 𝑑 .1
𝑐 𝑑 .8 𝑐 𝑑̅ .2
𝑐̅ 𝑑 0 𝑐̅ 𝑑̅ 1
Winter? (𝐴)
Wet Grass? (𝐷)
𝐶 𝐸 Θ%|$ 𝑐 𝑒 .7 𝑐 𝑒̅ .3
𝑐̅ 𝑒 0 𝑐̅ 𝑒̅ 1
𝐴 Θ! 𝑎 .6 𝑎C .4
Sprinkler? (𝐵)
𝑏 𝑏 𝑏 𝑏 𝑏C 𝑏C 𝑏C 𝑏C
Slippery Road? (𝐸)
𝐴 𝐶 Θ$|! 𝑎 𝑐 .8 𝑎 𝑐̅ .2 𝑎C 𝑐 .1 𝑎C 𝑐̅ .9
Bayesian Networks: Definition
§ A Bayesian network for variables 𝒁 is a pair (𝐺, Θ), where
§ 𝐺 is a directed acyclic graph over variables 𝒁, called the network structure
§ Θ is a set of CPTs, one for each variable in 𝒁, called the network parametrization
§ Θ”|𝑼 to denote the CPT for variable 𝑋 and its
§ 𝑋𝑼 to denote a set of variables known as
network family
§ 𝜃!|𝒖 is the value of 𝑃(𝑥|𝒖) known as network parameter
𝑏C .8 𝑏 .75 𝑏C .25
Θ&|”,$ 𝑑̅ .95 𝑑 .05
𝑑̅ .9 𝑑 .1 𝑑 .8 𝑑̅ .2 𝑑 0 𝑑̅ 1
Winter? (𝐴)
𝐴 𝑎 𝑎 𝑎C 𝑎C
𝐴 Θ! 𝑎 .6 𝑎C .4
Sprinkler? (𝐵)
𝑏 𝑏 𝑏 𝑏 𝑏C 𝑏C 𝑏C 𝑏C
Wet Grass? (𝐷)
𝐸 Θ%|$ 𝑒 .7
𝑒̅ .3 𝑒 0 𝑒̅ 1
Slippery Road? (𝐸)
𝑐 𝑐 𝑐̅ 𝑐̅ 𝑐 𝑐 𝑐̅ 𝑐̅
𝐶 𝑐 𝑐 𝑐̅ 𝑐̅
𝐴 𝐶 𝑎 𝑐 𝑎 𝑐̅ 𝑎C 𝑐 𝑎C 𝑐̅
.8 .2 .1 .9
Bayesian Networks: More Definition
§ Network instantiation is an assignment of all network variables
§ A network parameter 𝜃!|𝒖 is compatible with a network instantiation 𝒛 when 𝑥𝒖 and 𝒛 agree on common variables
§ We write 𝜃!|𝒖~𝒛
§ For instance, 𝜃&, 𝜃’|&, 𝜃)|̅ &, 𝜃*|’,)̅, and 𝜃,̅|)̅ are
parameters compatible with the instantiation 𝑎,𝑏,𝑐̅,𝑑,𝑒̅.
𝐴 𝑎 𝑎 𝑎C 𝑎C
𝐵 Θ”|! 𝑏 .2 𝑏C .8 𝑏 .75 𝑏C .25
Winter? (𝐴)
𝐴 Θ! 𝑎 .6 𝑎C .4
Sprinkler? (𝐵)
𝑑 .05 𝐶 𝑑̅ .9 𝑐 𝑑 .1 𝑐 𝑑 .8 𝑐̅ 𝑑̅ .2 𝑐̅ 𝑑 0
𝑏 𝑏 𝑏 𝑏 𝑏C 𝑏C 𝑏C 𝑏C
𝑐 𝑐 𝑐̅ 𝑐̅ 𝑐 𝑐 𝑐̅ 𝑐̅
Wet Grass? (𝐷)
𝐸 Θ%|$ 𝑒 .7
𝑒̅ .3 𝑒 0 𝑒̅ 1
Slippery Road? (𝐸)
𝐴 𝐶 𝑎 𝑐 𝑎 𝑐̅ 𝑎C 𝑐 𝑎C 𝑐̅
.8 .2 .1 .9
Bayesian Networks: More Definition
§ Only one probability distribution satisfies the constrains imposed by a Bayesian network
§ The distribution is given by 𝑃(𝒛) ≝ L 𝜃!|𝒖
§ This equation is known as the chain rule for Bayesian networks
𝐴 𝐵 Θ”|! 𝑎 𝑏 .2 𝑎 𝑏C .8 𝑎C 𝑏 .75 𝑎C 𝑏C .25
Winter? (𝐴)
𝐴 Θ! 𝑎 .6 𝑎C .4
Sprinkler? (𝐵)
𝐵 𝐶 𝐷 Θ&|”,$
Wet Grass? (𝐷)
Slippery Road? (𝐸)
𝑏𝑐𝑑 𝑏𝑐𝑑̅ 𝑏𝑐̅𝑑 𝑏𝑐̅𝑑̅ .1 𝑏C𝑐𝑑 .8 𝑏C𝑐𝑑̅ .2 𝑏C𝑐̅𝑑 0 𝑏C𝑐̅𝑑̅ 1
𝐴𝐶 𝑎𝑐 𝑎𝑐̅ 𝑎C𝑐 𝑎C𝑐̅
.8 .2 .1 .9
§ For instance, §𝑃𝑎,𝑏,𝑐̅,𝑑,𝑒̅=𝜃𝜃 𝜃 𝜃 𝜃
𝑐𝑒 𝑐 𝑒̅ 𝑐̅ 𝑒 𝑐̅ 𝑒̅
& ‘|& )|̅& *|’,)̅ ,̅|)̅ = .6 .2 .2 .9 1 = .0216
Bayesian Networks: Complexity
§ The size of the CPT Θ is exponential in ‘|𝑼
the number of parents 𝑼
§ If the maximal number of parents for every
variable is 𝑘 then the size of any CPT is 𝑂(𝑑./0), where 𝑑 is the number of values
§ With n network variables, the total number of variables is bounded by 𝑂(𝑛𝑑./0)
§ This number is reasonable if the number of parents is small
§ We will discuss techniques to represent CPTs when the number of parents is large
𝐴 𝑎 𝑎 𝑎C 𝑎C
𝐵 Θ”|! 𝑏 .2 𝑏C .8 𝑏 .75 𝑏C .25
Winter? (𝐴)
𝐴 Θ! 𝑎 .6 𝑎C .4
Sprinkler? (𝐵)
𝑑 .05 𝐶 𝑑̅ .9 𝑐 𝑑 .1 𝑐 𝑑 .8 𝑐̅ 𝑑̅ .2 𝑐̅ 𝑑 0
𝑏 𝑏 𝑏 𝑏 𝑏C 𝑏C 𝑏C 𝑏C
𝑐 𝑐 𝑐̅ 𝑐̅ 𝑐 𝑐 𝑐̅ 𝑐̅
Wet Grass? (𝐷)
𝐸 Θ%|$ 𝑒 .7
𝑒̅ .3 𝑒 0 𝑒̅ 1
Slippery Road? (𝐸)
𝐴 𝐶 𝑎 𝑐 𝑎 𝑐̅ 𝑎C 𝑐 𝑎C 𝑐̅
.8 .2 .1 .9
Properties of Independence
§ The distribution 𝑃 specified by a Bayesian network (𝐺, Θ) satisfies the independence assumptions in Markov(𝐺)
§ However, these are not the only independences satisfied by 𝑃
§ For example, 𝑅 ⊥ 𝐴 | 𝐸
§ This independence and other ones follow the ones
in Markov(𝐺)
§ If we use a set of properties known as graphoid axioms
§ These properties include symmetry, decomposition, weak union and contraction
𝑋 ⊥ 𝑁𝑜𝑛_𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠 𝑋 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠(𝑋)
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
Properties of Independence: Symmetry
§ Symmetry is the simplest property of probabilistic independence
§ If learning 𝒚 does not influence our belief in 𝒙, then learning 𝒙 does not influence our belief in 𝒚.
𝑿 ⊥ 𝒀 | 𝒁 if and only if 𝒀 ⊥ 𝑿 | 𝒁
§ In the example graph § 𝐴 ⊥ 𝑅 | 𝐵, 𝐸
§ 𝑅 ⊥ 𝐴 | 𝐵, 𝐸
(Markovian property for 𝐴)
(using symmetry) (𝑅)
Alarm? (𝐴)
Earthquake? (𝐸)
Burglary? (𝐵)
Properties of Independence: Decomposition
§ The second property is decomposition
§ If learning 𝒚𝒘 does not influence our belief in 𝒙, then learning 𝒚 alone, or learning 𝒘 alone, does not influence our belief in 𝒚.
𝑿 ⊥ 𝒀 ∪ 𝑾 | 𝒁 only if 𝑿 ⊥ 𝒀 | 𝒁 and 𝑿 ⊥ 𝑾 | 𝒁
§ In the example graph § 𝑅 ⊥ 𝐴, 𝐶, 𝐵 | 𝐸
§ 𝑅 ⊥ 𝐴 | 𝐸 § 𝑅 ⊥ 𝐶 | 𝐸 § 𝑅 ⊥ 𝐵 | 𝐸
(Markovian property for 𝐴) (using decomposition) (using decomposition) (using decomposition)
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
§ Decomposition allow us to state the following § 𝑋 ⊥ 𝑾 for every 𝑾 ⊆ Non_Descendants(𝑋)
§ Notice 𝑾 can be any subset of Non_descendants(𝑋)
Alarm? (𝐴)
Properties of Independence: Decomposition
§ Decomposition allow us to prove the chain rule for Bayesian networks
§ For this example network we have §𝑃(𝑒,𝑏,𝑟,𝑎,𝑐) = 𝜃,𝜃’𝜃1|,𝜃&|,,’𝜃)|& §𝑃(𝑒,𝑏,𝑟,𝑎,𝑐) =𝑃 𝑒 𝑃 𝑏 𝑃 𝑟𝑒 𝑃 𝑎𝑒,𝑏 𝑃(𝑐|𝑎)
§ By the chain rule
§ 𝑃(𝑒,𝑏,𝑟,𝑎,𝑐) = 𝑃 𝑒 𝑃 𝑏|𝑒 𝑃 𝑟 𝑏,𝑒 𝑃 𝑎 𝑒,𝑏,𝑟 𝑃(𝑐|𝑎,𝑒,𝑏,𝑟)
𝑃(𝒛) ≝ L 𝜃!|𝒖 -‘|𝒖~𝒛
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
Properties of Independence: Weak Union
§ The next property is weak union
§ If the information 𝒚𝒘 is not relevant to our belief in 𝒙, then the partial information 𝒚 will not make the rest of the information, 𝒘, relevant
𝑿 ⊥ 𝒀 ∪ 𝑾 | 𝒁 only if 𝑿 ⊥ 𝑾 | 𝒁 ∪ 𝒀
§ In the example graph § 𝐶 ⊥ 𝐵, 𝐸, 𝑅 | 𝐴
(Markovian property for 𝐴) (using decomposition)
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
§ 𝐶 ⊥ 𝑅 | 𝐴, 𝐵, 𝐸
§ Weka union allow us to state the following
Alarm? (𝐴)
§ 𝑋⊥Non_Descendants 𝑋 ∖𝑾|𝑃𝑎𝑟𝑒𝑛𝑡𝑠 𝑋 ∪𝑾 for every 𝑾 ⊆ Non_Descendants(𝑋)
§ This can be viewed as strengthening of the independences declared by Markov(𝐺)
Properties of Independence: Contraction
§ The fourth property is contraction
§ If after learning the irrelevant information 𝒚 the information 𝒘 is found to be irrelevant to our belief in 𝒙, then the combined information 𝒚𝒘 must have been irrelevant from the beginning
𝑿 ⊥ 𝒀 | 𝒁 and 𝑿 ⊥ 𝑾 | 𝒁 ∪ 𝒀 only if 𝑿⊥𝒀∪𝑾|𝒁
𝑋! 𝑋” 𝑋# 𝑋$
Properties of Independence: Intersection
§ The final axiom is intersection
§ It holds only for the class strictly positive distributions
§ If information 𝒘 is irrelevant given 𝒚 and information 𝒚 is irrelevant given 𝒘, then the combined information 𝒚𝒘 is irrelevant to start with
§ Symmetry, decomposition, weak union and contraction
§ Plus the property of triviality (𝑋 ⊥ ∅ | 𝑍), form the graphoid axioms
§ Plus intersection, the set is known as positive graphoid axioms
𝑿 ⊥ 𝒀 | 𝒁 ∪ 𝑾 and 𝑿 ⊥ 𝑾 | 𝒁 ∪ 𝒀 only if 𝑿⊥𝒀∪𝑾|𝒁
Graphical Test of Independence
§ 𝑃 is a distribution induced by the Bayesian network (𝐺, Θ) § 𝑃 satisfies independences that go beyond what is in Markov(𝐺)
§ Graphoid axioms derive new independences
§ However, this derivation is not trivial
§ A graphical test known as d-separation can capture the inferential power of graphoid axioms
§ Let 𝑿, 𝒀, and 𝒁 be three disjoint sets of variables
§ 𝑿 and 𝒀 are d-separated by 𝒁 in DAG 𝐺, if every path between a
node in 𝑿 and a node in 𝒀 is blocked by 𝒁
§ If 𝑿 and 𝒀 are d-separated by 𝒁 then 𝑿 ⊥ 𝒀 | 𝒁 for every probability distribution induced by 𝐺
Graphical Test of Independence: Blocking
§ Consider this path (note that it ignores the edges direction)
§ We will view this path as a pipe and each variable 𝑊 on the path as a
§ A valve 𝑊 is either open or closed, depending on some condition
§ If at least one of the valves on the path is closed, then the whole path is blocked
§ Otherwise the path is not blocked
§ There are three types of valves
§ They are determined by its relationship to its neighbours on the path
Sequential
Convergent
Graphical Test of Independence: Blocking
§ Consider this path (note that it ignores the edges direction)
§ We will view this path as a pipe and each variable 𝑊 on the path as a
§ A valve 𝑊 is either open or closed, depending on some condition
§ If at least one of the valves on the path is closed, then the whole path is blocked
§ Otherwise the path is not blocked
§ There are three types of valves
§ They are determined by its relationship to its neighbours on the path
Sequential
Convergent
Graphical Test of Independence: Blocking
§ To gain more intuition, let us use a causal interpretation
§ A sequential valve 𝑁0 → 𝑊 → 𝑁2 declares 𝑊 as an intermediary 𝑊
Sequential
Convergent Earthquake?
between cause 𝑁0 and its effect 𝑁2
§ A divergent valve 𝑁0 ← 𝑊 → 𝑁2 declares 𝑊 as a common cause of
two effects 𝑁0 and 𝑁2
§ A convergent valve 𝑁0 → 𝑊 ← 𝑁2 declares 𝑊 as a common effect
of two causes 𝑁0 and 𝑁2
§ Now, we can better motivate the conditions for closed
§ A sequential valve is closed iff 𝑊 appears in 𝒁
§ A divergent valve is closed iff 𝑊 appears in 𝒁
§ A convergent valve is closed iff neither 𝑊 nor any of its descendants appears in 𝒁
Burglary? (𝐵)
Radio? (𝑅)
Alarm? (𝐴)
D-Separation: Definition
§ Formal definition of d-separation
§ Let 𝑿,𝒀, and 𝒁 be disjoint sets of nodes in a DAG 𝑮. We will say that 𝑿 and 𝒀 are d-separated by 𝒁, written 𝑑𝑠𝑒𝑝3 (𝑿, 𝒁, 𝒀), iff every path between a node in 𝑿 and a node in 𝒀 is blocked by 𝒁
§ A path is blocked by 𝒁 iff at least one valve on the path is closed given 𝒁
§ Notice that a path with no valves (𝑋 → Y) is never blocked
Earthquake? (𝐸)
Radio? (𝑅)
Burglary? (𝐵)
Alarm? (𝐴)
D-Separation: Complexity
§ The definition of d-separation calls for considering all paths connecting a node in 𝑿 with a node in 𝒀
§ The number of paths can be exponential
§ But we can implement a test without enumerating these paths
§ Testing whether 𝑿 and 𝒀 are d-separated by 𝒁 in DAG 𝐺 is equivalent to testing whether 𝑿 and 𝒀 are disconnected in a new DAG 𝐺’, obtained as follows
§ We delete any leaf node 𝑊 from 𝐺 if 𝑊 does not belong to 𝑿 ∪
𝒀 ∪ 𝒁. This process is repeated until no more nodes can be deleted
§ We delete all edges outgoing from nodes in 𝒁
§ The connectivity test on DAG 𝐺’ ignores edge direction
§ This procedure time and space are linear in the size of the DAG 𝐺
𝐴, 𝑆 d-separated from 𝐷, 𝑋 by 𝐵, 𝑃? 𝑇, 𝐶 d-separated from 𝐵 by 𝑆, 𝑋?
D-Separation: Soundness and Completeness
§ The d-separation test is sound
§ If 𝑃 is a probability distribution induced by a Bayesian network
(𝐺, Θ) then 𝑑𝑠𝑒𝑝3(𝑿, 𝒁, 𝒀) only if 𝑿 ⊥ 𝒀 | 𝒁
§ We can safely use d-separation test to derive independence statements about the probability distributions induced by Bayesian networks
§ The proof is constructive and shows that every independence claimed by d-separation can be derived using the graphoid axioms
§ The d-separation test is not complete
§ It is not capable of inferring every possible independence
statement that holds in the induced distribution 𝑃
§ The explanation is that some independences may be hidden in the network parameters
𝐴 𝜃! 𝑎 .6 𝑎C .4
𝐴 𝐵 𝜃”|! 𝐴 𝑎 𝑏 .8
𝐵 𝐶 𝜃$|” 𝐶 𝑏 𝑐 .7
D-Separation: Soundness and Completeness
§ Therefore, if we choose the parametrization carefully, we establish independences that d-separation cannot detect
§ This is not surprising since d-separation has no access to the graph parametrization
§ We can conclude that, given a distribution 𝑃 induced by a Bayesian network 𝐺, Θ
§ If 𝑿 and 𝒀 are d-separated by 𝒁, then 𝑿 and 𝒀 are independent given 𝒁 for any parametrization Θ
§ If 𝑿 and 𝒀 are not d-separated by 𝒁, then whether 𝑿 and 𝒀 are dependent given 𝒁 depends on the specific parametrization Θ
𝐴 𝜃! 𝑎 .6 𝑎C .4
𝐴 𝐵 𝜃”|! 𝑎 𝑏 .8 𝑎 𝑏C .2 𝑎C 𝑏 .8 𝑎C 𝑏C .2
𝐵 𝐶 𝜃$|” 𝐶 𝑏 𝑐 .7
D-Separation: Soundness and Completeness
§ We can always parametrize a DAG 𝐺 in such a way to ensure the completeness of d-separation
§ d-separation satisfies the following weak notion of completeness
§ For every DAG G, there is a parametrization Θ such that 𝑑𝑠𝑒𝑝3(𝑿, 𝒁, 𝒀) if and only if 𝑿 ⊥ 𝒀 | 𝒁
§ This weaker notion of completeness implies that one cannot improve on the d-separation test
§ There is no other graphical test that can derive more independencies from 𝐺
𝐴 𝜃! 𝑎 .6 𝑎C .4
𝐴 𝐵 𝜃”|! 𝑎 𝑏 .8 𝑎 𝑏C .2 𝑎C 𝑏 .8 𝑎C 𝑏C .2
𝐵 𝐶 𝜃$|” 𝐶 𝑏 𝑐 .7
Independence Maps: I-MAPs
§ Independence maps describe the relationship between independence in a DAG and in a probability distribution
§ They are useful to understand the expressive power of DAGs as a language for independence statements
§ Let 𝐺 be a DAG and 𝑃 a probability distribution over the same variables
§𝐺isanindependencemap(I-MAP)of𝑃iff
§ It means that every independence declared by d-separation
holds in 𝑃
§ An I-MAP is minimal if 𝐺 ceases to be an I-MAP if we
delete any edges from 𝐺
§ If 𝑃 is induced by a Bayesian network (𝐺, Θ), then 𝐺 must be an
I-MAP of 𝑃
§ But it may not be minimal
𝑑𝑠𝑒𝑝3 𝑿,𝒁,𝒀 onlyif𝑿⊥𝒀|𝒁
Independence Maps: D-MAPs
§𝐺isadependencymap(D-MAP)of𝑃iff
§ It means that the lack of d-separation in 𝐺 implies a dependence in
§ If 𝑃 is induced by the Bayesian network (𝐺, Θ), then 𝐺 is not
necessarily a D-MAP of 𝑃
§ 𝐺 can be made a D-MAP of 𝑃 if we choose the parametrization Θ carefully
𝑿⊥𝒀|𝒁onlyif𝑑𝑠𝑒𝑝3 𝑿,𝒁,𝒀
Independence Maps: Perfect MAPs
§ If a DAG 𝐺 is both an I-MAP and a D-MAP of 𝑃, then 𝐺 is a perfect map
§ We want 𝐺 to be a P-MAP of the induced distribution to make all independences of 𝑃 accessible to d-separation
§ However, there are probability distributions for which there are no P-MAPs
§ Suppose we have four variables and a distribution 𝑃 that only satisfies these dependencies
§ There is no DAG that is a P-MAP of 𝑃 in this case
𝑋0 ⊥ 𝑋2 | 𝑌0, 𝑌2 𝑋2 ⊥ 𝑋0 | 𝑌0, 𝑌2 𝑌0 ⊥ 𝑌2 | 𝑋0, 𝑋2 𝑌2 ⊥ 𝑌0 | 𝑋0, 𝑋2
Independence Maps
§ Given a distribution 𝑃, how can we construct a DAG that is guaranteed to be a minimal I-MAP of 𝑃
§ Minimal I-MAPs tend to exhibit more independences
§ Therefore, requiring fewer parameters and leading to more compact
Earthquake? (𝐸)
Burglary? (𝐵)
Radio? networks (𝑅)
Alarm? (𝐴)
§ Procedure to build a minimal I-MAP
§ Given ordering 𝑋0, … , 𝑋4 of variables in 𝑃
§ Start with an empty DAG 𝐺 and consider the variable 𝑋5 for 𝑖 = 1…𝑛
§ For each 𝑋5, identify a minimal subset 𝑷 of variables 𝑋0, … , 𝑋560 suchthat𝑋5 ⊥X0…,𝑋560∖𝑷|𝑷
§ Make𝑷theparentsof𝑋5 in𝐺
Independence Maps
§ Suppose this graph is a P-MAP of some distribution 𝑃
Earthquake? (𝐸)
Burglary? (𝐵)
Radio? (𝑅)
Alarm? (𝐴)
Independence Maps
§ Suppose this graph is a P-MAP of some distribution 𝑃
Earthquake? (𝐸)
Burglary? (𝐵)
Alarm? (𝐴)
Burglary? (𝐵)
Earthquake? (𝐸)
Radio? (𝑅)
Radio? (𝑅)
Alarm? (𝐴)
Independence Maps
§ Suppose this graph is a P-MAP of some distribution 𝑃
Earthquake? Burglary? (𝐸) (𝐵)
Radio? Alarm? (𝑅) (𝐴)
I-MAP procedure
Earthquake? (𝐸)
Burglary? (𝐵)
Radio? (𝑅)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com