COMP9418: Advanced Topics in Statistical Machine Learning
Learning Bayesian Network Parameters with Maximum Likelihood
Instructor: University of Wales
Copyright By PowCoder代写 加微信 powcoder
Introduction
§ Consider this Bayesian network structure and dataset
§ Each row in the dataset is called a case and represent a medical record for a patient
§ Some cases are incomplete, where “?” indicates unavailability
§ Therefore, the dataset is said to be incomplete due to these missing values
§ Otherwise it is called complete
§ The objective of this lecture is to provide techniques for estimating parameters of a network structure from data
§ Given both complete and incomplete datasets
Tonsillitis
Sore Throat
Tonsillitis
Sore throat
Introduction
§ We can construct a network structure by § Design information
§ Working with domain experts
§ In this lecture, we discuss techniques to estimate the
CPTs from data
§ Also, we will discuss techniques for learning the network structure itself
§ Although we focus on the complete datasets for this subtask
§ The next slides list some possible learning tasks
True distribution
Domain expert
Elicitation
Bayesian network
Known Structure, Complete Data
§ This is the simplest setting
§ Given a network that factorizes 𝑃∗
§ Dataset with IID samples from 𝑃∗ § We need to output the CPTs
𝑐,𝑠̅,𝑟,𝑤 𝑐,𝑠,𝑟,𝑤 𝑐 ̅ , 𝑠 , 𝑟 , 𝑤( 𝑐,𝑠̅,𝑟,𝑤 𝑐̅, 𝑠̅, 𝑟̅, 𝑤
𝐶 𝑃(𝐶) 𝑐 0.5 𝑐̅ 0.5
𝐶 𝑅 𝑃(𝑅|𝐶) 𝑐 𝑟 0.8
Known Structure, Incomplete Data
§ Incomplete data complicates the problem considerably § Given a network that factorizes 𝑃∗
§ Dataset with IID samples from 𝑃∗ with unknown values § We need to output the CPTs
𝑐,?,𝑟,𝑤 𝑐,𝑠,𝑟,? 𝑐 ̅ , 𝑠 , 𝑟 , 𝑤( 𝑐,𝑠̅,?,𝑤 ?,𝑠̅,𝑟̅,𝑤
𝐶 𝑃(𝐶) 𝑐 0.5 𝑐̅ 0.5
𝐶 𝑅 𝑃(𝑅|𝐶) 𝑐 𝑟 0.8
Known Structure, Latent Variables
§ Latent variables are not recorded in data
§ Given a network that factorizes 𝑃∗
§ Dataset with IID samples from 𝑃∗ with unknown values and latent variables
§ We need to output the CPTs H
𝐶 𝑃(𝐶) 𝑐 0.5 𝑐̅ 0.5
𝐶 𝑅 𝑃(𝑅|𝐶) 𝑐 𝑟 0.8
𝑐,?,𝑟,𝑤,? 𝑐,𝑠,𝑟,?,? 𝑐̅,𝑠,𝑟,𝑤(,?
𝑐,𝑠̅,?,𝑤,? 𝑐̅𝑟0.2
?,𝑠̅,𝑟̅,𝑤,?
Unknown Structure, Complete Data
§ We may also want to learn the network structure
§ Given a set of random variables
§ Dataset with IID samples from 𝑃∗
§ We need to output the edges connectivity and CPTs
𝐶 𝑃(𝐶) 𝑐 0.5 𝑐̅ 0.5
𝐶 𝑅 𝑃(𝑅|𝐶) 𝑐 𝑟 0.8
𝑐,𝑠̅,𝑟,𝑤 𝑐,𝑠,𝑟,𝑤 𝑐 ̅ , 𝑠 , 𝑟 , 𝑤( 𝑐,𝑠̅,𝑟,𝑤 𝑐̅, 𝑠̅, 𝑟̅, 𝑤
Unknown Structure, Incomplete Data
§ A challenging scenario
§ Given a set of random variables
§ Dataset with IID samples from 𝑃∗ with unknown values § We need to output the edges connectivity and CPTs
𝐶 𝑃(𝐶) 𝑐 0.5 𝑐̅ 0.5
𝐶 𝑅 𝑃(𝑅|𝐶) 𝑐 𝑟 0.8
𝑐,?,𝑟,𝑤 𝑐,𝑠,𝑟,? 𝑐 ̅ , 𝑠 , 𝑟 , 𝑤( 𝑐,𝑠̅,?,𝑤 ?,𝑠̅,𝑟̅,𝑤
Estimating Parameter from Complete Data
h 𝑠̅ 𝑒 h𝑠𝑒 h 𝑠̅ 𝑒 h𝑠𝑒 h 𝑠̅ 𝑒 h 𝑠̅ 𝑒
§ Consider this simple network
§ Our goal is to estimate its parameters from the
§ Our assumption are
§ These cases are generated independently § According to their true probabilities
§ Under these assumptions
§ We can define an empirical distribution 𝑃𝒟
§ According to this distribution, the empirical probability of an instantiation is simply its frequency of occurrence
Health Aware (H)
𝐻 𝑆 𝐸 𝑃𝒟(.) h 𝑠 𝑒 2/16 h 𝑠 𝑒̅ 0/16 h 𝑠̅ 𝑒 9/16 h 𝑠̅ 𝑒̅ 1/16 h/ 𝑠 𝑒 0/16 h/ 𝑠 𝑒̅ 1/16 h/ 𝑠̅ 𝑒 2/16 h/ 𝑠̅ 𝑒̅ 1/16
Estimating Parameter from Complete Data
h 𝑠̅ 𝑒 h𝑠𝑒 h 𝑠̅ 𝑒 h𝑠𝑒 h 𝑠̅ 𝑒 h 𝑠̅ 𝑒
§ Empirical distribution 𝑃𝒟
𝑃𝒟 h,𝑠,𝑒 =𝒟#(h,𝑠,𝑒)
§ 𝒟#(h, 𝑠, 𝑒) is the number of cases in dataset
𝒟 that satisfies instantiation h, 𝑠, 𝑒 § 𝑁 is the dataset size
Health Aware (H)
Smokes Exercises (S) (E)
𝐻 𝑆 𝐸 𝑃𝒟(.) h 𝑠 𝑒 2/16 h 𝑠 𝑒̅ 0/16 h 𝑠̅ 𝑒 9/16 h 𝑠̅ 𝑒̅ 1/16 h/ 𝑠 𝑒 0/16 h/ 𝑠 𝑒̅ 1/16 h/ 𝑠̅ 𝑒 2/16 h/ 𝑠̅ 𝑒̅ 1/16
Estimating Parameter from Complete Data
§ We can now estimate parameters based on the empirical distribution
§ For example, the parameter 𝜃7|9 §Correspondsto𝑃𝒟 𝑠|h
§ Probability a person will smoke given they are health-aware 𝑃𝒟 𝑠|h =𝑃𝒟 𝑠,h = 2/16 =1
Health Aware (H)
𝐻 𝑆 𝐸 𝑃𝒟(.) h 𝑠 𝑒 2/16 h 𝑠 𝑒̅ 0/16 h 𝑠̅ 𝑒 9/16 h 𝑠̅ 𝑒̅ 1/16 h/ 𝑠 𝑒 0/16 h/ 𝑠 𝑒̅ 1/16 h/ 𝑠̅ 𝑒 2/16 h/ 𝑠̅ 𝑒̅ 1/16
1 2 / 1 6 6
Empirical Distribution: Definition
§ A dataset 𝒟 for variables 𝑿 is a vector 𝒅#, … , 𝒅$ where each 𝒅% is called a case and represents a partial instantiation of variables 𝑿
§ The dataset is complete if each case is a complete instantiation of variables 𝑿 § Otherwise, the dataset is incomplete
§ The empirical distribution for a complete dataset 𝐷 is defined as 𝑃𝒟(𝛼) ≝ 𝒟# 𝛼
§ 𝒟#(𝛼) is the number of cases 𝒅” in the dataset 𝒟 that satisfy the event 𝛼
Complete Data Parameter Estimation: Definition
§ We can estimate the parameter 𝜃 by the empirical probability &|𝒖
𝜃&’≝𝑃 𝑥𝒖=𝒟#𝑥,𝒖 #|𝒖 𝒟 𝒟#(𝒖)
Health Aware (H)
Smokes Exercises
is called a sufficient statistic needed for a particular estimation task
§ Considering the network structure and corresponding dataset § We have the following parameter estimates
𝑠̅ 𝑒̅ 1/16
𝑠̅ 𝑒̅ 1/16
§ The count 𝒟# 𝑥, 𝒖
§ More generally, any function of the data is called a statistic
h h h h h/ h/ h/ h/
§ A sufficient statistic is a statistic that contains all the information in the data
h 3/4 h/ 1/4
𝐻 𝑆 𝜃”# $|!
h 𝑠 1/6 h 𝑠̅ 5/6 h/ 𝑠 1/2 h/ 𝑠̅ 1/2
𝐻 𝐸 𝜃”# &|!
h 𝑒 11/12 h 𝑒̅ 1/12 h/ 𝑒 1/2 h/ 𝑒̅ 1/2
Complete Data Parameter Estimation: Definition
§ We expect the variance of 𝜃)* will decrease as the dataset increases in
§ If the dataset is an IID sample of a distribution 𝑃
§ The Central Limit Theorem tells us 𝜃&’ is asymptotically Normal
§ It can be approximated by a Normal distribution with § Mean
§ Variance
§ The variance depends on 𝑁, 𝑃(𝒖) and 𝑃(𝑥|𝒖)
§ It vey sensitive to 𝑃(𝒖), and it is difficult to estimate this parameter when this probability is small
§ Small 𝑃(𝒖) and not large enough 𝑁 leads to the problem of zero counts
§ We have seen this problem before in the Naïve Bayes lecture and will return to it when we discuss Bayesian learning
𝑃𝑥𝒖 1−𝑃𝑥𝒖 𝑁𝑃(𝒖)
Maximum Likelihood (ML) Estimates
§ Let 𝜃 be the set of all parameter estimates for a network 𝐺 § 𝑃( be the probability distribution induced by 𝐺 and 𝜃
§ We define the likelihood of these estimates as
§ That is, the likelihood of estimates 𝜃 is the probability of observing the
dataset 𝐷 under these estimates
§ We can show that given a complete dataset 𝒟, the parameters
𝜃)* are the only estimates that maximize the likelihood &|𝒖
§ For this reason, these estimates are called maximum likelihood (ML) estimates
§ They are denoted by 𝜃&’
𝐿(𝜃; 𝒟) ≝ ∏$ 𝑃 (𝒅 ) %+# , %
𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥, 𝐿(𝜃; 𝒟) ∗ iff
𝜃&|𝒖 = 𝑃𝒟(𝑥|𝒖)
𝜃)* = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐿(𝜃; 𝒟) ,
ML Estimates and KL Divergence
§ ML estimates also minimize the KL divergence between the learned Bayesian network and the empirical distribution
§ For a complete dataset 𝒟 and variables 𝑿
argmax( 𝐿(𝜃; 𝒟) = argmin(KL(𝑃𝒟 𝑿 , 𝑃( 𝑿 )
§ ML estimates are unique for a given structure 𝐺 and complete dataset 𝒟
§ Therefore, the likelihood of these parameters is a function of 𝐺 and 𝒟 § We define the likelihood of structure 𝐺 given 𝒟 as
§ Where 𝜃&’ are the ML estimates for structure 𝐺 and dataset 𝒟
𝐿(𝐺; 𝒟) ≝ 𝐿(𝜃)*; 𝒟)
Log-Likelihood
§ Often, it is more convenient to work with the logarithm of the likelihood function
𝐿𝐿(𝜃; 𝒟) ≝ log 𝐿(𝜃; 𝒟) = X+ log 𝑃((𝒅”) “)*
§ The log-likelihood of structure 𝐺 is defined similarly
§ Maximizing the log-likelihood is equivalent to maximizing the likelihood function
§ Although likelihood is ≥ 0 and log-likelihood is ≤ 0
§ We use log, for the log-likelihood but suppress the base 2
𝐿𝐿(𝐺; 𝒟) ≝ log 𝐿(𝐺; 𝒟)
Log-Likelihood
§ A key property of log-likelihood function is that it decomposes into several components § One for each family in the Bayesian network structure
§ Let 𝐺 be a structure and 𝒟 a complete dataset of size 𝑁. If 𝑋𝑼 ranges over the families of structure 𝐺, then
𝐿𝐿(𝐺; 𝒟) = −𝑁 X 𝐻𝒟(𝑋|𝑼) -𝑼
§ Where 𝐻𝒟(𝑋|𝑼) is the conditional entropy, defined as
𝐻𝒟 𝑋𝑼 =−X𝑃𝒟 𝑥𝒖 log,𝑃𝒟(𝑥|𝒖) #𝒖
Estimating Parameters from Incomplete Data
§ The parameter estimates considered so far have a number of interesting properties
§ They are unique, asymptotically Normal, and maximize the probability of data § They are easy to compute with a single pass on the dataset
§ Given these properties, we could seek for maximum likelihood estimates for incomplete data as well
§ However, the properties of these estimates will depend on the nature of incompleteness
Incomplete Data: Example
§ For example, consider the network structure on the right § 𝐶 is a medical condition and 𝑇 a test for detecting this condition § Let’s also suppose the true parameters are given by the tables
§ Hence, we have 𝑃(𝑣𝑒) = 𝑃(𝑣𝑒) = .5
§ Consider now the following incomplete datasets
𝑦𝑒𝑠 .25 𝑛𝑜 .75
𝑦𝑒𝑠 𝑣𝑒 .80 𝑦𝑒𝑠 𝑣𝑒 .20 𝑛𝑜 𝑣𝑒 .40 𝑛𝑜 𝑣𝑒 .60
1 2 3 4 5 6 7 8
? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒
1 2 3 4 5 6 7 8
𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 𝑛𝑜 ? 𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 𝑛𝑜 ?
1 2 3 4 5 6 7 8
𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 ? 𝑣𝑒
𝑛𝑜 ? 𝑦𝑒𝑠 𝑣𝑒 ? 𝑣𝑒
𝑛𝑜 ? 𝑛𝑜 𝑣𝑒
Incomplete Data: Example
§ Let us consider the first dataset 𝒟#
§ The cases split equally between 𝑣𝑒 and 𝑣𝑒 values of 𝑇
§ We expect this to be true in the limit given the distribution of this data
§ We can show the ML estimates are not unique for this dataset § The ML estimates for 𝒟* are characterized by the following equation
𝜃 𝜃+𝜃 𝜃=1 0)12|3)425 3)425 0)12|3)67 3)67 2
§ The true parameters satisfy this equation § But the following estimates do as well 1
𝜃3)425 = 1, 𝜃0)12|3)425 = 2 § With 𝜃0)12|3)67 taking any value
𝑦𝑒𝑠 .25 𝑛𝑜 .75
𝑦𝑒𝑠 𝑣𝑒 .80 𝑦𝑒𝑠 𝑣𝑒 .20 𝑛𝑜 𝑣𝑒 .40 𝑛𝑜 𝑣𝑒 .60
1 2 3 4 5 6 7 8
? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒
Incomplete Data: Example
§ Therefore, ML estimates are not unique for this dataset
§ This is not surprising since incomplete datasets may not contain enough information to pin down the true parameters
§ The nonuniqueness of ML estimates is a desirable property
𝑦𝑒𝑠 .25 𝑛𝑜 .75
𝑦𝑒𝑠 𝑣𝑒 .80 𝑦𝑒𝑠 𝑣𝑒 .20 𝑛𝑜 𝑣𝑒 .40 𝑛𝑜 𝑣𝑒 .60
1 2 3 4 5 6 7 8
? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒 ? 𝑣𝑒
Incomplete Data: Example
§ Consider now dataset 𝒟. to illustrate why data may be missing:
§ People who do not suffer from the condition tend to not take the test. That is, the
data is missing because the test is not performed
§ People who test negative tend not to report the result. That is, the test is performed but its value is not recorded
§ These two scenarios are different in a fundamental way
§ In the second scenario, the missing value provides some evidence its true value must
be negative
§ ML estimates give the intended results for the first scenario but not for the second one as it does not integrate all the information about the second scenario
§ However, we return to this topic later to show that ML can still be applied under the second scenario but requires some explication of the missing data mechanism
1 2 3 4 5 6 7 8
𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 𝑛𝑜 ? 𝑦𝑒𝑠 𝑣𝑒 𝑦𝑒𝑠 𝑣𝑒 𝑛𝑜 ?
Expectation Maximization (EM)
§ Consider the Bayesian network on the right
§ Suppose our goal is to find ML estimates for the dataset 𝒟
§ We start with initial estimates 𝜃8 with the following likelihood 9
𝐿(𝜃;𝒟) =g𝑃(!(𝒅”)
=𝑃 !(𝑏,𝑐̅)𝑃 !(𝑏,𝑑̅)𝑃 !(𝑏3,𝑐,𝑑)𝑃 !(𝑏3,𝑐,𝑑)𝑃 !(𝑏,𝑑̅)
” ” ” ” ” :9
= .135 .184 .144 .144 .184 =9.5×10
§ Evaluating the terms in this product generally requires inference on the Bayesian network
§ Contrary, the complete data case each term can be evaluated using the chain rule for the Bayesian network
𝐴 𝐵 𝜃* +|)
𝑎 𝑏 .75 𝑎 𝑏/ .25 𝑎/ 𝑏 .10 𝑎/ 𝑏/ .90
𝐴 𝐶 𝜃* ‘|)
𝑎𝑐.50 𝑎𝑐̅ .50 𝑎/ 𝑐 .25 𝑎/𝑐̅ .75
𝐴 𝜃)* 𝑎 .20
𝐵 𝐷 𝜃* +|,
Expectation Maximization (EM)
§ The expectation maximization (EM) algorithm is based on the complete data method
§ EM first completes the dataset, inducing an empirical distribution
§ Then it estimates parameters using ML
§ The new set of parameters are guaranteed to have no less
likelihood than the initial parameters
§ This process is repeated until some convergence condition is met
§ For instance, the first case of dataset 𝒟 has variables 𝐴 and 𝐷 with missing values
§ There are four possible completions for this case
§ Although we do not know which one is correct, we can compute the probability of each completion based on the initial set of parameters
𝐴 𝐵 𝜃* 𝐴 +|)
𝑎 𝑏/ .25 𝐵𝐶 𝑎/ 𝑏 .10
𝐴𝐶𝜃* 𝐷 ‘|)
𝑎 𝑐 .50 𝑎 𝑐̅ .50 𝑎/ 𝑐 .25 𝑎/ 𝑐̅ .75
𝐴 𝜃)* 𝑎 .20
Expected Empirical Distribution
§ This tables lists for each case 𝒅%
§ The probability of each completion, 𝑃(!(𝒄”|𝒅”)
§ Where, 𝑪” are the variables with missing values in 𝒅”
§ The completed dataset defines an (expected) empirical distribution
§ The probability of an instantiation is computed considering all its occurrences in the completed dataset
§ However, instead of counting the number of occurrences, we add up the probabilities
§ For instance, there are 3 occurrences of instantiation 𝑎,𝑏,𝑐̅,𝑑̅ in cases 𝒅#, 𝒅. and 𝒅/
𝒟𝐴𝐵𝐶𝐷 𝒅% ? 𝑏 𝑐̅ ?
𝑎 𝑏 𝑐̅ 𝑑 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎B 𝑏 𝑐̅ 𝑑 𝑎B 𝑏 𝑐̅ 𝑑̅
𝑎 𝑏 𝑐 𝑑̅ 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎B 𝑏 𝑐 𝑑̅ 𝑎B 𝑏 𝑐̅ 𝑑̅
𝑎 𝑏B 𝑐 𝑑 𝑎B 𝑏B 𝑐 𝑑
𝑎 𝑏B 𝑐 𝑑 𝑎B 𝑏B 𝑐 𝑑
𝑎 𝑏 𝑐 𝑑̅ 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎B 𝑏 𝑐 𝑑̅ 𝑎B 𝑏 𝑐̅ 𝑑̅
𝑃”!(𝑪#|𝒅𝒊)
.111 = 𝑃”!(𝑎,𝑑|𝑏,𝑐̅) .444
.326 = 𝑃”!(𝑎,𝑐|𝑏,𝑑̅) .326
.122 = 𝑃”!(𝑎|𝑏B,𝑐,𝑑) .878
.122 = 𝑃”!(𝑎|𝑏B,𝑐,𝑑) .878
.326 = 𝑃”!(𝑎,𝑐|𝑏,𝑑̅) .326
Expected Empirical Distribution
§ The probability, 𝑃 𝑎, 𝑏, 𝑐̅, 𝑑̅ , of seeing these completions is 𝑃”! 𝑎,𝑑̅ 𝑏,𝑐̅ +𝑃”! 𝑎,𝑐̅ 𝑏,𝑑̅ +𝑃”!(𝑎,𝑐̅|𝑏,𝑑̅)
= .444+.326+.326=.219
§ We can define the expected empirical distribution of dataset 𝒟 under parameters 𝜃; as
𝒟𝐴𝐵𝐶𝐷 𝒅% ? 𝑏 𝑐̅ ?
𝑎 𝑏 𝑐̅ 𝑑 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎B 𝑏 𝑐̅ 𝑑 𝑎B 𝑏 𝑐̅ 𝑑̅
𝑃”!(𝑪#|𝒅𝒊)
.111 = 𝑃”!(𝑎,𝑑|𝑏,𝑐̅)
𝑎 𝑏 𝑐 𝑑̅ 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎B 𝑏 𝑐 𝑑̅ 𝑎B 𝑏 𝑐̅ 𝑑̅
𝑃𝒟,$!(𝛼) ≝ 𝑁1 ? 𝑃$!(𝒄)|𝒅)) 𝒅( 𝒅”,𝒄”⊨(
§ Where 𝛼 is an event and 𝑪) are the variables with missing values in case 𝒅)
§ 𝒅) , 𝒄) ⊨ 𝛼 means that event 𝛼 is satisfied by complete case 𝒅) , 𝒄)
𝑎 𝑏B 𝑐 𝑑 𝑎B 𝑏B 𝑐 𝑑
𝑎 𝑏B 𝑐 𝑑 𝑎B 𝑏B 𝑐 𝑑
𝑎 𝑏 𝑐 𝑑̅ 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎B 𝑏 𝑐 𝑑̅ 𝑎B 𝑏 𝑐̅ 𝑑̅
.326 = 𝑃”!(𝑎,𝑐|𝑏,𝑑̅)
.122 = 𝑃”!(𝑎|𝑏B,𝑐,𝑑) .878
.122 = 𝑃”!(𝑎|𝑏B,𝑐,𝑑) .878
.326 = 𝑃”!(𝑎,𝑐|𝑏,𝑑̅)
Expected Empirical Distribution
𝑎 𝑏 𝑐 𝑑̅ 𝑎𝑏𝑐𝑑 𝑎 𝑏 𝑐̅ 𝑑̅ 𝑎 𝑏 𝑐̅ 𝑑 𝑎 𝑏/ 𝑐 𝑑 𝑎 𝑏/ 𝑐 𝑑̅ 𝑎 𝑏/ 𝑐̅ 𝑑 𝑎 𝑏/ 𝑐̅ 𝑑̅ 𝑎/ 𝑏 𝑐 𝑑̅ 𝑎/ 𝑏 𝑐 𝑑 𝑎/ 𝑏 𝑐̅ 𝑑̅ 𝑎/ 𝑏 𝑐̅ 𝑑 𝑎/ 𝑏/ 𝑐 𝑑 𝑎/ 𝑏/ 𝑐 𝑑̅ 𝑎/ 𝑏/ 𝑐̅ 𝑑 𝑎/ 𝑏/ 𝑐̅ 𝑑̅
0 .130 .022 .219 .049 0 0 0 0 .035 .018 .176 .351 0 0 0
§ Given the definition of expected empirical distribution we can compute𝑃𝒟,,K forallinstantiationsofvariables𝐴,𝐵,𝐶and𝐷
§ When the dataset is complete
§ 𝑃𝒟,(# (. ) reduces to the empirical probability 𝑃𝒟(. ), which is independent of parameter 𝜃;
§ Moreover, 𝑁𝑃𝒟,(# (𝒙) is called expected count of instantiation 𝒙
§ We can use the expected empirical distribution to estimate parameters
§ Like we did for the complete data
§ For instance, for the parameter 𝜃=|?>
𝜃* = 𝑃 ! 𝑐|𝑎/ = 𝑃𝒟,(!(𝑐,𝑎/) ≈ .666 =|?> 𝒟,( 𝑃𝒟,(! (𝑎/)
Expectation Maximization (EM)
𝐴 𝐵 𝜃L +|)
𝑎 𝑏 .883 𝑎 𝑏/ .117 𝑎/ 𝑏 .395 𝑎/ 𝑏/ .605
𝐴 𝜃)L 𝑎 .420
𝐵 𝐷 𝜃L +|,
𝑏 𝑑 .933 𝑏/ 𝑑 1 𝑏/𝑑̅0
§ The figure on the right shows all parameter estimates based on 𝑃𝒟,(! leading to new estimates 𝜃#
§ The new estimates 𝜃# have the following likelihood for dataset 𝒟 9
𝐿(𝜃*;𝒟) = g𝑃($(𝒅”)
= .290 .560 .255 .255 .560
= 5.9×10:/ > 𝐿 𝜃8 𝒟
§ Therefore, we can define the EM estimates for a dataset 𝒟 and parameters 𝜃E as
#(𝑥|𝒖) #|𝒖 𝒟,(
𝐴 𝐶 𝜃L ‘|)
𝑎 𝑐 .426 𝑎 𝑐̅ .574 𝑎/ 𝑐 .666 𝑎/ 𝑐̅ .334
Expectation Maximization (EM)
§ EM estimates can be computed without constructing the expected empirical distribution
§ The expected empirical distribution of dataset 𝒟 given parameters 𝜃; can be computed as
§ That is, we simply iterate over the dataset cases computing the probability of 𝛼 for each case
§ The EM estimates can now be computes as
§ This equation computes EM estimates performing E
inferenceinaBayesiannetworkparametrizesby𝜃 .For
𝑃 # 𝛼 = 1 X 𝑃 # (𝛼|𝒅 ) 𝒟,( 𝑁 ( ”
∑+ 𝑃#(𝑥𝒖|𝒅) “)* ( ”
#|𝒖 ∑+ 𝑃#(𝒖|𝒅) “)*( ”
∑0 𝑃 #(𝑐,𝑎I|𝒅 ) 𝜃. = )/. $ ) =
*|-, ∑1 𝑃 #(𝑎I|𝒅 ) )/.$ )
0+.087+.878+.878+.087 .444+.348+.878+.878+.348
EM: Algorithm
𝜃2 ← initial parameter values
while convergence criterion is not met do 𝑐3𝒖 ← 0 for each family instantiation 𝑥𝒖 for 𝑖 ← 1 to 𝑁 do
for each family instantiation 𝑥𝒖 do
𝑐3𝒖 ←𝑐3𝒖 +𝑃$!(𝑥𝒖|𝒅))
𝜃25.←𝑐 /∑∗𝑐∗ 3|𝒖 3𝒖33𝒖
𝑘←𝑘+1 return 𝜃2
#requiresinferenceonnetwork(𝐺,𝜃2)
• The stop criterion usually employed is a small difference between 𝜃2 and
𝜃25. or a small change in log-likelihood
EM Algorithm: Observations
§ There are a few observations about the behaviour of the EM algorithm
§ The algorithm may converge to different parameters depending on the initial estimate 𝜃8
§ It is common to run the algorithm multiple times, starting with different estimates in each iteration § In this case, we return the best estimates across all iterations
§ Each iteration of the EM algorithm will have to perform inference on a Bayesian network
§ In each iteration, the algorithm computes the probability of each instantiation 𝑥𝒖 given each case 𝒅” as evidence
§ These computations correspond to posterior marginals over network families
§ Therefore, we can use an algorithm such as the jointree that efficiently computes family marginals
Missing Data Mechanism
§ Let us consider again the network where 𝐶 represents a medical condition and 𝑇 a test for detecting this condition
§ We depict two extended network structures for this problem
§ Each includes an additional variable 𝐼 that indicates whether the test result
is missing in the dataset
§ In the left network, the missing data depends on the condition
§ E.g., people who do not suffer from the condition tend not to take the test
§ In the right network, the missing data depends on the test result § E.g., individuals who test negative tend not to report the result
§ Hence, these networks structures explicate different 𝐼 dependencies between missing data missingness
§ We say the structures explicate different missing data mechanisms
Missing Data Indicator
§ Our goal is to discuss ML estimates that we would obtain with respect to structures that explicate missing data mechanisms
§ And compare these estimates with those obtained when ignoring such mechanisms
§ E.g., when we use the simpler structure on the top
§ Let 𝑴 be the variables of a network 𝐺 that have missing values in the data set
§ We define 𝑰 as a set of variables called missing data indicators that are in one-to-one correspondence with variables 𝑴
§ A network structure that results from adding variables 𝑰 as leaf nodes to 𝐺 is s
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com