程序代做 COMP9418: Advanced Topics in Statistical Machine Learning

COMP9418: Advanced Topics in Statistical Machine Learning
Learning Bayesian Networks Structure with Maximum Likelihood
Instructor: University of Wales

Copyright By PowCoder代写 加微信 powcoder

Introduction
§ So far, we have assumed we know the structure of a Bayesian network § We were mostly concerned with estimating its parameters
§ The main approach for this estimation has been the search for ML estimates
§ That is, the ones that maximize the probability of observing a given dataset
§ Now, we assume the network structure is unknown § We want to learn the structure from a given dataset
§ We adopt the same approach for parameter estimation
§ That is, we search for network structures that maximize the probability of observing a given dataset
§ We start with this approach and show it leads to a general class of scoring functions for network structures
§ We focus on learning structures from complete data
§ Dealing with incomplete data is similar but much more demanding computationally

Introduction
§ Consider the Bayesian network on the right
§ Remind that in the previous lecture, we defined the log- likelihood of a structure 𝐺 as
𝐿𝐿(𝐺; 𝒟) ≝ log 𝐿(𝜃!”; 𝒟)
§ For the dataset 𝒟, we can compute 𝐿𝐿(𝐺; 𝒟) as &&
𝐴 𝐵 𝜃”# $|!
𝑎 𝑏 3/4 𝑎 𝑏3 1/4 𝑎3 𝑏 1 𝑎3 𝑏3 0
= log-𝑃’!”(𝒅#) = 0log𝑃’!”(𝒅#) #$% #$%
𝐴 𝐶 𝜃”# ‘|!
𝑎 𝑐 1/4 𝑎 𝑐̅ 3/4 𝑎3 𝑐 1 𝑎3 𝑐̅ 0
𝐴 𝜃!”# 𝑎 4/5 𝑎3 1/5
𝑃’!” (𝒅# ) log 𝑃’!” (𝒅# )
𝑎 𝑏 𝑐̅ 𝑑 3! ×3! ×4! ×1!
.1125 −3.15 .3375 −1.57 .05 −4.32 .15 −2.74 .3375 −1.57
𝑎𝑏𝑐̅𝑑̅34 34 45 34 !×!×!×!
𝑎𝑏𝑐𝑑 1!×1!×4!×1 ̅445
𝑎3 𝑏 𝑐 𝑑 1×1×1! ×3! ̅54
𝑎 𝑏 𝑐̅ 𝑑 3! ×3! ×4! ×3! 4454
𝐿𝐿 𝐺;𝒟 = −13.35
𝑏 𝑑̅ 1/4 𝑏 𝑑 3/4 𝑏3 𝑑 1 𝑏3 𝑑̅ 0
𝑐̅ 𝑑̅ 𝑐̅ 𝑑 𝑐 𝑑̅ 𝑐 𝑑̅ 𝑐̅ 𝑑

Introduction
§ Now, consider this alternate network structure 𝐺∗ and ML estimates
§ Let us compute the log-likelihood of this network for the same dataset
𝜃”# 𝐴 𝐵 $|!
𝑎 𝑏 3/4 𝑎 𝑏3 1/4 𝑎3 𝑏 1 𝑎3 𝑏3 0
𝑃’!” (𝒅# ) log 𝑃’!” (𝒅# )
𝑎 𝑏 𝑐̅ 𝑑 3! ×3! ×4! ×1! .225
−2.15 −2.15 −5.32 −2.32 −2.15
!×!×!×! 𝑎𝑏𝑐̅𝑑̅34 34 45 12 .225
3 14144512 𝑎𝑏𝑐𝑑 !×!×!×! .025
1×1× ! ×1 𝑎3𝑏𝑐𝑑̅44152 .2
𝑎 𝑏 𝑐̅ 𝑑 3! ×3! ×4! ×1! .225 4452
𝐿𝐿 𝐺∗;𝒟 = −14.09
𝑎 𝑐 1/4 𝑎 𝑐̅ 3/4 𝑎3 𝑐 1 𝑎3 𝑐̅ 0
𝐴 𝜃!”# 𝑎 4/5 𝑎3 1/5
𝑐̅ 𝑑̅ 𝑐̅ 𝑑 𝑐 𝑑̅ 𝑐 𝑑̅ 𝑐̅ 𝑑
§As𝐿𝐿 𝐺∗;𝒟 <𝐿𝐿 𝐺;𝒟 ,weprefer𝐺to𝐺∗ § Our goal is to search for a maximum likelihood structure 𝑎 𝑑̅ 1/2 𝑎 𝑑 1/2 𝑎3 𝑑̅ 0 𝑎3 𝑑 1 Learning Tree Structures § We start exploring an algorithm based on a scoring measure for tree structures § Therefore, each node will have at most one parent § The scoring function is expressed in terms of mutual information § Mutual information between variables 𝑋 and 𝑈 § It is a measure of dependence between these variables in a distribution § Our scoring measure is based on mutual information in the empirical distribution § Given a tree structure 𝐺 with edges 𝑈 → 𝑋 § The scoring function is given by § The score is the addition of MI between every variable and its single parent in the tree structure 𝑀𝐼𝒟(𝑋, 𝑈) ≝ * 𝑃𝒟 𝑥, 𝑢 ",𝒖 𝑃𝒟(𝑥, 𝑢) 𝑃𝒟 𝑥 𝑃𝒟(𝑢) 𝑡𝑆𝑐𝑜𝑟𝑒(𝐺;𝒟)≝ *𝑀𝐼𝒟(𝑋,𝑈) %→' Learning Tree Structures § We can show that trees having a maximal likelihood are those trees that maximize 𝑡𝑆𝑐𝑜𝑟𝑒 § That is, if 𝐺 is a tree structure, and 𝒟 is a complete dataset, then § Therefore, we can find a maximum likelihood tree using an algorithm for computing maximum spanning trees § This is known as the Chow-Liu algorithm § We will discuss this algorithm shortly § But before that, lets us understand the relationship between 𝑡𝑆𝑐𝑜𝑟𝑒 and maximum log-likelihood 𝑎𝑟𝑔𝑚𝑎𝑥( 𝑡𝑆𝑐𝑜𝑟𝑒 𝐺; 𝒟 = 𝑎𝑟𝑔𝑚𝑎𝑥( 𝐿𝐿(𝐺; 𝒟) 𝑡𝑆𝑐𝑜𝑟𝑒 and Maximum Log-likelihood § Let us remind the definitions of § Entropy § Conditional entropy, and § Mutual information § Also, remember from the previous lecture that the log-likelihood decomposes into components § Each component is a conditional entropy § Expanding MI and substituting the definitions of entropy and conditional entropy leads to §𝑀𝐼𝒟 𝑋,𝑼 =𝐻𝒟 𝑋 −𝐻𝒟(𝑋|𝑼) 𝐻𝒟 𝑋 ≝−*𝑃𝒟 𝑥 log𝑃𝒟(𝑥) " ≝ − * 𝑃𝒟 𝑥, 𝒖 log 𝑃𝒟(𝑥|𝒖) ",𝒖 𝑃𝒟(𝑥,𝒖) 𝑀𝐼𝒟(𝑋, 𝑼) ≝ * 𝑃𝒟 𝑥, 𝒖 log 𝑃𝒟 𝑥 𝑃𝒟(𝒖) 𝐿𝐿(𝐺; 𝒟) = −𝑁 * 𝐻𝒟(𝑋|𝑼) '𝑼 𝑡𝑆𝑐𝑜𝑟𝑒 and Log ML § Therefore, 𝐿𝐿(𝐺; 𝒟) = −𝑁 ∑*+ 𝐻𝒟(𝑋|𝑈) = −𝑁 ∑*+(𝐻𝒟(𝑋) − 𝑀𝐼𝒟(𝑋, 𝑈)) =−𝑁∑*+𝐻𝒟 𝑋 +𝑁∑*+𝑀𝐼𝒟(𝑋,𝑈) §Notethatneither𝑁northeterm−𝑁∑EF𝐻𝒟 𝑋 dependonthetreestructure𝐺.Hence, 𝑎𝑟𝑔𝑚𝑎𝑥# 𝐿𝐿 𝐺; 𝒟 = 𝑎𝑟𝑔𝑚𝑎𝑥# 2 𝑀𝐼𝒟(𝑋, 𝑈) = 𝑎𝑟𝑔𝑚𝑎𝑥# 𝑡𝑆𝑐𝑜𝑟𝑒 𝐺; 𝒟 $% 𝐻𝒟 𝑋 ≝−*𝑃𝒟 𝑥 log𝑃𝒟(𝑥) " 𝐻𝒟 𝑋𝑼 ≝−*𝑃𝒟 𝑥,𝒖 log𝑃𝒟(𝑥|𝒖) ",𝒖 𝑀𝐼𝒟(𝑋,𝑼)≝*𝑃𝒟 𝑥,𝒖 log 𝑃𝒟(𝑥,𝒖) ",𝒖 𝑃𝒟 𝑥 𝑃𝒟(𝒖) 𝐿𝐿(𝐺;𝒟) = −𝑁*𝐻𝒟(𝑋|𝑼) '𝑼 𝑀𝐼𝒟 𝑋,𝑼 =𝐻𝒟 𝑋 −𝐻𝒟(𝑋|𝑼) Learning Tree Structures 𝑐̅ 𝑑̅ 𝑐̅ 𝑑 𝑐 𝑑̅ 𝑐 𝑑̅ 𝑐̅ 𝑑 § Let us return to the algorithm for learning ML tree structures using 𝑡𝑆𝑐𝑜𝑟𝑒 § We illustrate this algorithm for variables 𝐴, 𝐵, 𝐶 and 𝐷 and the dataset on the right § The first step is constructing a complete, undirected graph over all variables § The cost of each edge is the mutual information between the two nodes connected by the edge § Let us detail the computation for edge 𝐴 − 𝐵 Learning Tree Structures 𝑀𝐼𝒟(𝑋,𝑈)≝*𝑃𝒟 𝑥,𝑢 log 𝑃𝒟(𝑥,𝑢) ",* 𝑃𝒟 𝑥 𝑃+(𝑢) =𝑃𝒟 𝑎,𝑏 log 𝑃𝒟(𝑎,𝑏) +𝑃𝒟 𝑎,𝑏O log 𝑃𝒟(𝑎,𝑏O) 𝑃𝒟 𝑎 𝑃+(𝑏) 𝑃𝒟 𝑎 𝑃+(𝑏O) =𝑃𝒟 𝑎O,𝑏 log 𝑃𝒟(𝑎O,𝑏) +𝑃𝒟 𝑎O,𝑏O log 𝑃𝒟(𝑎O,𝑏O) 𝑐̅ 𝑑̅ 𝑐̅ 𝑑 𝑐 𝑑̅ 𝑐 𝑑̅ 𝑐̅ 𝑑 𝑃𝒟 𝑎O 𝑃+(𝑏) 𝑃𝒟 𝑎O 𝑃+(𝑏O) 𝐴 𝐵 𝑃(𝐴,𝐵) 𝑃(𝐴) 𝑃(𝐵) 𝑅(𝐴,𝐵) log𝑅(𝐴,𝐵) 𝑃log𝑅(𝐴,𝐵) 𝑎 𝑏 3! 4! 4! .94 −.09 −.054 𝑎 𝑏 ! ! ! 1.25 .32 .064 𝑎3 𝑏 15 15 45 1.25 .32 .064 𝑎3 𝑏3 05 15 15 0 −∞ 0 !!! 𝑀𝐼𝒟 𝐴,𝐵 = .074 Learning Tree Structures 𝑐̅ 𝑑̅ 𝑐̅ 𝑑 𝑐 𝑑̅ 𝑐 𝑑̅ 𝑐̅ 𝑑 § The first step is constructing a complete, undirected graph over all variables § The cost of each edge is the mutual information between the two nodes connected by the edge § We compute the spanning tree with maximal cost § The cost of the tree is just the sum of costs associated with its edges § The algorithm for maximum spanning tree is a trivial adaptation of algorithms for minimum spanning tree such as Prim and Kruskal § This method generates an undirected spanning tree § It coincides with several directed trees § We can choose any of these trees by selecting a node as root and directing edges away from the root § The resulting trees are guaranteed to have a maximal likelihood among all tree structures Learning Tree Structures: Prim .32 𝐶 𝐵 .32 𝐶 Learning Tree Structures: Prim .32 𝐵.32 𝐶 𝐵 .32 𝐶 𝐵 .32 𝐶 .32 .32 Log-likelihood: -12.1 Log-likelihood: -12.1 Learning Tree Structures 𝑐̅ 𝑑̅ 𝑐̅ 𝑑 𝑐 𝑑̅ 𝑐 𝑑̅ 𝑐̅ 𝑑 § Given a tree, we can compute the log-likelihood by § Computing the probability of each case in the dataset, as we did in slide 7 § Or use the following equation that shows the log-likelihood corresponds to a sum of terms, one for each family in the network 𝐿𝐿(𝐺; 𝒟) = −𝑁 0 𝐻𝒟(𝑋|𝑈) *+ § For this network, we have =−𝑁(𝐻𝒟 𝐴𝐶 +𝐻𝒟 𝐵 +𝐻𝒟 𝐶𝐵 +𝐻𝒟(𝐷|𝐵)) = −5(.400 + .722 + .649 + .649) = −12.1 § Notice the terms correspond to families of the tree structure § 𝐴𝐶, 𝐵, 𝐶𝐵, and 𝐷𝐵 Learning Tree Structures: Example § Tree learned from ICU- Alarm network data § Not every edge in tree is in the original network § Inferred edges are undirected Correct edges Spurious edges Learning DAG Structures § Suppose now our goal is to find a maximum likelihood structure § Without restricting to tree structures § For example, consider the tree structure obtained in the previous slide plus the edge 𝐷 → 𝐴 § The log-likelihood of this DAG is 𝐿𝐿(𝐺; 𝒟) =−𝑁(𝐻𝒟 𝐴𝐶,𝐷 +𝐻𝒟 𝐵 +𝐻𝒟 𝐶𝐵 +𝐻𝒟(𝐷|𝐵)) = −5(0 + .722 + .649 + .649) § Which is larger than the log-likelihood of the tree Learning DAG Structures § Notice the only difference between the two likelihoods is the entropy for variable 𝐴 § This is the only variable with different families in the two structures § The family of 𝐴 is 𝐴𝐶 in the tree and 𝐴𝐶𝐷 in the DAG 𝐻𝒟 𝐴𝐶,𝐷 <𝐻𝒟(𝐴|𝐶) −𝐻𝒟 𝐴𝐶,𝐷 >−𝐻𝒟(𝐴|𝐶)
§ Which is why the DAG has larger log-likelihood than the tree
§ However, this result is not completely accidental §Wecanshowthatif𝑼⊆𝑼∗,then𝐻𝒟 𝑋𝑼 ≥𝐻𝒟(𝑋|𝑼∗)
§ Adding more parents to a variable never increases the entropy term § Therefore, never decreases the log-likelihood of the resulting structure

Learning DAG Structures and Overfitting
§ Therefore, if a DAG 𝐺∗ is the result of adding edges to a DAG 𝐺, then 𝐿𝐿𝐺∗;𝒟 ≥𝐿𝐿(𝐺;𝒟)
§ If we simply search for a network structure with maximal likelihood
§ We end up choosing a complete network structure
§ That is, a DAG to which no more edges can be added without introducing cycles
§ Complete DAGs are undesirable for several reasons § They make no assertions of conditional independence
§ Their topology does not reveal any properties of the distribution they induce 𝐶
§ A complete DAG over 𝑛 variables has treewidth of 𝑛 − 1
§ Complete DAGs suffer of the problem of overfitting: the use of a model that has too many parameters compared to the available data

Overfitting
§ In summary, the problem of overfitting occurs when
§ We focus on learning a model that fits the data well
§ Without constraining enough the number of free model parameters
§ The result is that we end up adopting models that are more complex than necessary
§ Such models tend to have poor generalization performance
§ That is, they perform poorly on cases that are not part of the dataset
§ There is no agreed upon solution for overfitting
§ However, the available solutions are based on the principle of Occam’s Razor § Which states we should prefer simpler models, other things being equal

Model Complexity: DAG Dimension
§ To realize Occam’s razor principle, we need
§ A measure of model complexity
§ A method for balancing model complexity and data fit
§ A common choice for model complexity is the number of independent parameters in the model
§ This gives us the following measure of complexity
§ Let 𝐺 be a DAG over variables 𝑋X, … , 𝑋Y with corresponding parents 𝑼X, … , 𝑼Y and let 𝑌# denote the number of instantiations for variables 𝑌
§ The dimension of 𝐺 is defined as
§ Therefore, the dimension of a DAG is equal to the number of independent
parameters in its CPTs
# # # 𝑋#𝑼# ≝ 𝑋# − 1 𝑼#

Model Complexity: DAG Dimension
§ For example, the dimension of this DAG considering all binary variables is
§ 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶 =2.Total=7
### 𝑋#𝑼# ≝𝑋#−1𝑼#

Model Complexity: DAG Dimension
§ For example, the dimension of this DAG considering all binary variables is
§ 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶 =2.Total=7 § 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶𝐷 =4.Total=9
### 𝑋#𝑼# ≝𝑋#−1𝑼#

Model Complexity: DAG Dimension
§ For example, the dimension of this DAG considering all binary variables is
§ 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶 =2.Total=7
§ 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶𝐷 =4.Total=9
§ 𝐵 =1, 𝐶𝐵𝐷 =4, 𝐷𝐵 =2, 𝐴𝐵𝐶𝐷 =8.Total=15
### 𝑋#𝑼# ≝𝑋#−1𝑼#

Model Complexity: DAG Dimension
§ For example, the dimension of this DAG considering all binary variables is
§ 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶 =2.Total=7
§ 𝐵 =1, 𝐶𝐵 =2, 𝐷𝐵 =2, 𝐴𝐶𝐷 =4.Total=9
§ 𝐵 =1, 𝐶𝐵𝐷 =4, 𝐷𝐵 =2, 𝐴𝐵𝐶𝐷 =8.Total=15
§ Using this notion of model complexity, we can define the following class of scoring measures
𝑆𝑐𝑜𝑟𝑒G;𝒟 ≝𝐿𝐿𝐺;𝒟 −𝜓(𝑁)⋅ 𝐺
§ The first component is the log-likelihood of the graph 𝐺
§ The second one 𝜓(𝑁) ⋅ 𝐺 is a penalty term that favours simple models
### 𝑋#𝑼# ≝𝑋#−1𝑼#
§ The penalty term has a weight, 𝜓 𝑁 size 𝑁
≥ 0, that is a function of the dataset

Scoring Measures
§ When the penalty weight 𝜓(𝑁) is a constant independent of 𝑁
§ We obtain a score in which the model’s complexity is a secondary issue
§ The log-likelihood function 𝐿𝐿(𝐺; 𝒟) grows linearly in the dataset size 𝑁
§ Therefore, the log-likelihood will quickly dominate the penalty term
§ The model complexity will be used to distinguish between models that have relatively equal log- likelihood terms
§ This scoring measure is known as Akaike information criterion (AIC)
§ However, this term grows logarithmically in 𝑁, while the log-likelihood term grows linearly
§ The influence of model complexity decreases as 𝑁 grows, allowing the log-likelihood eventually
§ This penalty weight gives rise to the minimum description length (MDL) score
§ A more common choice is 𝜓 𝑁 = ( ? , which leads to a more influential term ^

Minimum Description Length (MDL)
§ Minimum Description Length (MDL) score 𝑀𝐷𝐿𝐺;𝒟≝𝐿𝐿𝐺;𝒟−log.𝑁 𝐺
§ For example, the network on the top has the following MDL score =−12.1− /01’& 7
. = −12.1 − 8.1
§ For example, the network on the bottom has the following MDL score =−10.1− /01’& 9
. = −10.1 − 10.4

Minimum Description Length (MDL)
§ Minimum Description Length (MDL) score 𝑀𝐷𝐿𝐺;𝒟≝𝐿𝐿𝐺;𝒟−log.𝑁 𝐺
§ For example, the network on the top has the following MDL score = −20.2
§ For example, the network on the bottom has the following MDL score = −20.5
§ Therefore, MDL score prefers the top network even though it has smaller log-likelihood
§ The MDL score is also known as Bayesian information criterion (BIC) § It is sometimes expressed as the negative of the score above
§ The goal becomes to minimize the score instead of maximizing it

Searching for Network Structure
§ Searching for a network structure can be quite expensive
§ Due to the very large number of structures we need to consider
§ Greedy algorithms tend to be more practical when learning structures § Systematic search algorithms can be practical under some conditions
§ Both classes of algorithms rely for their efficiency on a property of scoring functions
§ Most score functions are decomposable or modular
§ They allow to decompose the score into an aggregate of local scores § One for each network family

Searching for Network Structure
§ For instance, the class of scores
𝑆𝑐𝑜𝑟𝑒𝐺;𝒟 ≝𝐿𝐿𝐺;𝒟 −𝜓𝑁 ⋅ 𝐺
§ can be decomposed as
𝑆𝑐𝑜𝑟𝑒 𝐺; 𝒟 = 0 𝑆𝑐𝑜𝑟𝑒(𝑋, 𝑼; 𝐺) *𝑼
𝑆𝑐𝑜𝑟𝑒𝑋,𝑼;𝒟 ≝−𝑁⋅𝐻𝒟 𝑋𝑼 −𝜓𝑁 ⋅ 𝑋𝑼 § Note how the score is a sum of local scores, each for some family 𝑋𝑼
§ The contribution of each family is split into two parts
§ One resulting from a decomposition of the log-likelihood component of the score
§ The other resulting from a decomposition of the penalty component
§ This decomposition enables several efficient implementations of heuristic and optimal search algorithms

Local Search
§ We can search for network structure by starting with some initial structure
§ Then modify it locally to increase its score
§ The initial structure can be chosen randomly, based on some prior
knowledge, or can be a tree structure as discussed previously
§ The local modifications to the structure are constrained to
§ Adding an edge
§ Removing an edge
§ Reversing an edge while ensuring the structure remains a DAG
§ These local changes to the network structure also change the score
§ Possibly increasing or decreasing it
§ However, the goal is to commit to the change that increases the score the most
§ If none of the local changes can increase the score, the algorithm terminates and returns the current structure

Local Search
§ Some observations about this local search algorithm
§ It is not guaranteed to return an optimal network structure
§ The only guarantee provided by the algorithm is that the returned structure is locally optimal
§ That is, no local change can improve its score
§ This suboptimal behaviour can be improved by techniques such as random restarts
§ We repeat the local search multiple times, each time starting with a different network
§ Return the network with the best score across all repetitions

Local Search
§ Another observation relates to score updating
§ Consider the networks 𝐺 (centre) and 𝐺∗ (after removing edge
§ Since this change affects only the family of node 𝐵, we have
𝑆𝑐𝑜𝑟𝑒 𝐺∗; 𝒟 = 𝑆𝑐𝑜𝑟𝑒 𝐺; 𝒟 − 𝑆𝑐𝑜𝑟𝑒(𝐵, 𝐴; 𝒟) + 𝑆𝑐𝑜𝑟𝑒(𝐵; 𝒟) § That is, we discount the contribution of 𝐵’s old family and add
the contribution of 𝐵’s new family
§ More generally, adding or removing one edge changes only one family, while reversing an edge changes only two families
§ The score can always be updated locally as a result of the local network change

Constraining the Search Space
§ A technique for reducing the search space size is to assume a total ordering on variables
§ Search only among network structures that are consistent with the chosen order
§ For instance, we use variable order 𝑋-, … , 𝑋.
§ The search process tries to find for each variable 𝑋/ a set of parents 𝑼/ ⊆ 𝑋-, … , 𝑋/0-
§ This technique also allows us to decompose the search into 𝑛 independent problems
§ Each concerned with finding a set of parents for some network variable
§ That is, the search problem reduces to considering each variable 𝑋/ independently § And then finding a set of parents 𝑼/ ⊆ 𝑋-, … , 𝑋/0- that maximize 𝑆𝑐𝑜𝑟𝑒 𝑋/, 𝑼/; 𝒟
§ We discuss greedy and optimal methods for maximizing these local scores

Greedy Search
§ A well-known heuristic algorithm for optimizing a family score is known as K3 § It starts with an empty set of parents
§ Successively add variables to the set one at a time
§ Until such additions no longer increase the score
§ For example, we want to find a set of parents for 𝑋& from 𝑋%, … 𝑋3 § We start with 𝑼1 = ∅, and try to find 𝑋/ that maximizes
𝑆𝑐𝑜𝑟𝑒 𝑋1, 𝑋/; 𝒟 ≥ 𝑆𝑐𝑜𝑟𝑒(𝑋1; 𝒟)
§ Suppose 𝑋3 is such a variable, then 𝑼1 = {X3} and we search for another variable 𝑋/ that maximizes
𝑆𝑐𝑜𝑟𝑒 𝑋1, 𝑋3, 𝑋/; 𝒟 ≥ 𝑆𝑐𝑜𝑟𝑒(𝑋1, 𝑋3; 𝒟)
§ Suppose 𝑋2 happens to be such a variable, then 𝑼1 = X2, X3
§ If no other variable can improve the score, then K3 terminates with 𝑼1 = X2, X3 as the parent set for 𝑋1

Greedy Search
§ K3 is not guaranteed to identify the optimal set of parents § That is, the one that maximizes 𝑆𝑐𝑜𝑟𝑒 𝑋# , 𝑼# ; 𝒟
§ Therefore, it is common to use the output of this algorithm as a starting point for other algorithms
§ Such as the local search algorithm discussed previously § Or the optimal search algorithm we discuss next
𝑋- 𝑋- 𝑋- 𝑋2 𝑋2 𝑋2 𝑋3 𝑋3 𝑋3 𝑋4 𝑋4 𝑋4 𝑋1 𝑋1

Optimal Search
§ We discuss an optimal search algorithm for network structures
§ Based on branch-and-bound depth-first search
§ Like K3, it assumes a total order of variables 𝑋-, … , 𝑋.
§ The search is restricted among structures consistent with this order
§ This allow us to decompose the search into 𝑛 independent search problems
§ For instance, consider the search for parents of 𝑋&
§ The first level of the tree represents an empty set of
parents 𝑼1 = { }
§ Each additional level corresponds by adding a single variable to each parent set while avoiding duplicate sets
{𝑋-, 𝑋2} {𝑋-, 𝑋3}
{𝑋4} {𝑋3, 𝑋4}
{𝑋-, 𝑋2, 𝑋3} {𝑋-, 𝑋2, 𝑋4} {𝑋-, 𝑋3, 𝑋4}
{𝑋-, 𝑋2, 𝑋3, 𝑋4}
{𝑋2, 𝑋3, 𝑋4}
§ The search tree has 2/0- nodes, corresponding to the number of subsets we can choose from 𝑋-, … , 𝑋/0-

Optimal Search: Pruning
§ We can search this tree using depth-first search
§ Maintaining the score 𝑠 of the best

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com