Learning Parameters of Bayesian
Networks with EM
Collins reading; AIMA 20.3
CMPSC 442
Week 10, Meeting 29, Three Segments
Outline
● Formalizing Naive Bayes
● EM for Learning a Naive Bayes model
● Generalizing EM
2Outline, Wk 10, Mtg 29
Learning Parameters of Bayesian
Networks with the EM Algorithm
Collins reading; AIMA 20.3
CMPSC 442
Week 10, Meeting 29, Segment 1 of 3: Formalizing Naive
Bayes
Formalizing Naive Bayes
● MLE Parameter Estimates for Multinomial Naive Bayes
○ Generalization of MLE estimates for binary Naive Bayes
● EM for Estimating NB parameters πq and ρn for M classes
4Formalizing Naive Bayes
Expectation Maximization and Naive Bayes
● EM learns the parameters of a probabilistic model
● Meeting 28 presented EM for learning the parameters of an HMM
● Naive Bayes is one of the simplest probabilistic models
○ If the training data has class labels, the MLE parameter estimates can be
computed from the observed distribution in the data
○ If the training data lacks class labels, the Naive Bayes parameters can be
learned through EM
5Formalizing Naive Bayes
Assumptions for Multinomial Naive Bayes
● PQ is the set of all distributions over Q
● For each such distribution, π is a vector with components πq for each q
∈ Q corresponding to the probability of q occurring
6Formalizing Naive Bayes
EM for Naive Bayes
● Initialization: Assign initial values to the parameters
● Expectation step: Calculate the posterior probability of the class given
the observed attribute values and the current parameters θ from
initialization or from the M-step
● Maximization step: Calculate the new parameter values based on the
class assignments from the E-step
● Continue for T iterations
7Formalizing Naive Bayes
Toy Example: Classes and Words
● Consider an unlabeled dataset (class for each document is unknown)
with four random variables Vi for words in the document conditioned
on two classes = {Unknown
1
, Unknown
2
}
○ V1 = Obama
○ V2 = McCain
○ V3 = Giants
○ V4= Patriots
8Formalizing Naive Bayes
Training Data
● Five documents represented as length-4 vectors di of the attribute
values for each document
○ d1 = 〈8, 5, 0, 0 〉
○ d2 = 〈0, 0, 6, 7 〉
○ d3 = 〈10, 4, 0, 0 〉
○ d4 = 〈1, 0, 4, 4 〉
○ d5 = 〈0, 0, 5, 6 〉
9Formalizing Naive Bayes
Manually Derived Parameters
● Parameters that provide a good fit to the data (with smoothing), where
good fit means that these parameters predict the data
10Formalizing Naive Bayes
Marginalize Out the Labels for Posterior of the Data
● Probability of each example document di is the sum over all classes of the
joint probability of the document di and the class Q
● Probability of each document di is thus given by the product of the prior
probability of the class with the products of the conditional probabilities of
each attribute in the document
11Formalizing Naive Bayes
Log Likelihood of the Parameters θ
● The log-likelihood function L(θ) is the sum over all documents of the
logs of their probabilities (given their attribute values)
● The parameters are those that maximize the log-likelihood function,
subject to the constraints that the probabilities of the each class sum to
one, and the conditional probability of each word in each class sums to
one
12Formalizing Naive Bayes
Learning Parameters of Bayesian
Networks with the EM Algorithm
Collins reading; AIMA 20.3
CMPSC 442
Week 10, Meeting 29, Segment 2 of 3: EM for Naive Bayes
E-Step for Naive Bayes
● Define δ(qm|di ) to be the conditional probability of the class qm given di
and the parameter values θ t-1 from iteration t−1
● In each E-step, calculate the values δ(qm|di )
14EM for Naive Bayes
M-Step for Naive Bayes
● Define γt(qm) to be the new parameter value at iteration t for the prior
probability of class qm
● Define γj
t(v|qm) to be the new parameters at iteration t for the
conditional probabilities of each jth word v given the class qm
15EM for Naive Bayes
EM Algorithm: Input and Initialization
Input
● An integer m for the number of classes
● Training examples di for i = 1 . . . D where each di ∈
● A parameter T specifying the number of iterations
Initialization: set γ0(qm) and γj
0(v|qm) to some initial (e.g., random) values
satisfying the constraints
16EM for Naive Bayes
EM Algorithm for Naive Bayes
17EM for Naive Bayes
Naive Bayes E-Step in Detail
● Where there are d examples, m classes, n words, calculate the
posterior of the class based on each example using the parameters θt-1
consisting of γt-1(qm) and γj
t(v|qm)
18EM for Naive Bayes
Convergence Properties of EM
● Theorem: the log-likelihood function of the parameters is
non-decreasing
● In the limit as T goes to ∞, EM converges to a local optimum
19EM for Naive Bayes
Learning Parameters of Bayesian
Networks with the EM Algorithm
Collins reading; AIMA 20.3
CMPSC 442
Week 10, Meeting 29, Segment 3 of 3
Generative Models Support Parameter Estimation
● A joint probability distribution supports queries about any one or more
random variables by marginalizing over the other variables
● Therefore the same model can be used in multiple directions:
○ To estimate the posterior probability of labels given the data
○ To estimate the posterior probability of the data given the labels
21 Generalizing EM
Models with Latent Variables are More Compact
● Behavior-Disease-Symptom network with three behaviors, a disease, and
three symptoms has 2×3 + 54 + 3×6 parameters = 78
● Behavior-Symptom network with the same three behaviors and three
symptoms but no latent variable has 2×3 + 54 + 162 + 486 parameters = 708
22 Generalizing EM
● Each observed variable
has 3 values (none,
moderate, severe), and is
shown with how many
parameters it has
Latent Variable Models
● The validity of the hidden variable (e.g., part-of-speech tag; disease)
depends on empirical evidence
○ Explanatory power of the hidden variable for multiple questions
○ Ability of trained judges to agree on its presence or absence
● Given that it can be difficult to get (enough) labeled data, EM can be
used to estimate the model parameters
23 Generalizing EM
Unsupervised Clustering of Continuous Data
● Gaussian distribution
○ Many things in nature, e.g., attributes of Iris species: sepal length, sepal
width, petal length, petal width
○ Iris-Setosa dataset
● Mixture of Gaussians
○ Three species of iris
24
Iris Setosa dataset, Scikit Learn
○ Gaussian Mixture Model can identify three
distinct clusters in an unsupervised manner
using EM
Generalizing EM
Gaussian Mixture Model
● Parameters of a mixture of Gaussians are
○ The weight of each component
○ The mean of each component
○ The covariance of each component
25
Gaussian Mixture
Model
500 data points
sampled from the model
Model reconstructed
from the data by EM
Generalizing EM
Norvig & Russell, 3rd Ed., Fig 20.12
General Form of EM
● Where x is all the observed values in all the examples, Z is all the hidden
variables for all the examples, and θ is all the parameters
○ E-step is the summation over P(Z=z | x, θ(k)), which is the posterior of the hidden
variables given the data
○ M-step is the maximization of the expected log likelihood L(x,Z = z|θ )
● For mixtures of Gaussians, the hidden variables are the Zijs, where Zij=1 if example j was
generated by component i
● For Bayes nets, Zij is the value of unobserved variable Xi in example j
● For HMMs, Zjt is the state of the sequence in example j at time t
● Many improvements exist, and many other applications of EM are possible
26 Generalizing EM
Summary
● The formal specification of multinomial Naive Bayes identifies two sets
of random variables (classes and attributes), and the parameters for
each
● For unlabelled data where the examples are represented as attribute
value pairs, EM can be used to find the NB parameters
● EM is not guaranteed to find a global maximum, but finds a local
maximum, and often performs very well
● A very general form of EM can be given that encompasses multiple
applications: HMMs, Naive Bayes, continuous data, and Bayes’ Nets
27Summary, Wk 10, Mtg 29