CS计算机代考程序代写 Bayesian algorithm Learning Parameters of Bayesian

Learning Parameters of Bayesian
Networks with EM
Collins reading; AIMA 20.3

CMPSC 442
Week 10, Meeting 29, Three Segments

Outline

● Formalizing Naive Bayes
● EM for Learning a Naive Bayes model
● Generalizing EM

2Outline, Wk 10, Mtg 29

Learning Parameters of Bayesian
Networks with the EM Algorithm
Collins reading; AIMA 20.3

CMPSC 442
Week 10, Meeting 29, Segment 1 of 3: Formalizing Naive
Bayes

Formalizing Naive Bayes

● MLE Parameter Estimates for Multinomial Naive Bayes
○ Generalization of MLE estimates for binary Naive Bayes

● EM for Estimating NB parameters πq and ρn for M classes

4Formalizing Naive Bayes

Expectation Maximization and Naive Bayes

● EM learns the parameters of a probabilistic model
● Meeting 28 presented EM for learning the parameters of an HMM
● Naive Bayes is one of the simplest probabilistic models

○ If the training data has class labels, the MLE parameter estimates can be
computed from the observed distribution in the data

○ If the training data lacks class labels, the Naive Bayes parameters can be
learned through EM

5Formalizing Naive Bayes

Assumptions for Multinomial Naive Bayes

● PQ is the set of all distributions over Q
● For each such distribution, π is a vector with components πq for each q

∈ Q corresponding to the probability of q occurring

6Formalizing Naive Bayes

EM for Naive Bayes

● Initialization: Assign initial values to the parameters
● Expectation step: Calculate the posterior probability of the class given

the observed attribute values and the current parameters θ from
initialization or from the M-step

● Maximization step: Calculate the new parameter values based on the
class assignments from the E-step

● Continue for T iterations

7Formalizing Naive Bayes

Toy Example: Classes and Words

● Consider an unlabeled dataset (class for each document is unknown)
with four random variables Vi for words in the document conditioned
on two classes = {Unknown

1
, Unknown

2
}

○ V1 = Obama
○ V2 = McCain
○ V3 = Giants
○ V4= Patriots

8Formalizing Naive Bayes

Training Data

● Five documents represented as length-4 vectors di of the attribute
values for each document
○ d1 = 〈8, 5, 0, 0 〉
○ d2 = 〈0, 0, 6, 7 〉
○ d3 = 〈10, 4, 0, 0 〉
○ d4 = 〈1, 0, 4, 4 〉
○ d5 = 〈0, 0, 5, 6 〉

9Formalizing Naive Bayes

Manually Derived Parameters

● Parameters that provide a good fit to the data (with smoothing), where
good fit means that these parameters predict the data

10Formalizing Naive Bayes

Marginalize Out the Labels for Posterior of the Data

● Probability of each example document di is the sum over all classes of the
joint probability of the document di and the class Q

● Probability of each document di is thus given by the product of the prior
probability of the class with the products of the conditional probabilities of
each attribute in the document

11Formalizing Naive Bayes

Log Likelihood of the Parameters θ

● The log-likelihood function L(θ) is the sum over all documents of the
logs of their probabilities (given their attribute values)

● The parameters are those that maximize the log-likelihood function,
subject to the constraints that the probabilities of the each class sum to
one, and the conditional probability of each word in each class sums to
one

12Formalizing Naive Bayes

Learning Parameters of Bayesian
Networks with the EM Algorithm
Collins reading; AIMA 20.3

CMPSC 442
Week 10, Meeting 29, Segment 2 of 3: EM for Naive Bayes

E-Step for Naive Bayes

● Define δ(qm|di ) to be the conditional probability of the class qm given di
and the parameter values θ t-1 from iteration t−1

● In each E-step, calculate the values δ(qm|di )

14EM for Naive Bayes

M-Step for Naive Bayes

● Define γt(qm) to be the new parameter value at iteration t for the prior
probability of class qm

● Define γj
t(v|qm) to be the new parameters at iteration t for the

conditional probabilities of each jth word v given the class qm

15EM for Naive Bayes

EM Algorithm: Input and Initialization

Input
● An integer m for the number of classes
● Training examples di for i = 1 . . . D where each di ∈
● A parameter T specifying the number of iterations

Initialization: set γ0(qm) and γj
0(v|qm) to some initial (e.g., random) values

satisfying the constraints

16EM for Naive Bayes

EM Algorithm for Naive Bayes

17EM for Naive Bayes

Naive Bayes E-Step in Detail

● Where there are d examples, m classes, n words, calculate the
posterior of the class based on each example using the parameters θt-1
consisting of γt-1(qm) and γj

t(v|qm)

18EM for Naive Bayes

Convergence Properties of EM

● Theorem: the log-likelihood function of the parameters is
non-decreasing

● In the limit as T goes to ∞, EM converges to a local optimum

19EM for Naive Bayes

Learning Parameters of Bayesian
Networks with the EM Algorithm
Collins reading; AIMA 20.3

CMPSC 442
Week 10, Meeting 29, Segment 3 of 3

Generative Models Support Parameter Estimation

● A joint probability distribution supports queries about any one or more
random variables by marginalizing over the other variables

● Therefore the same model can be used in multiple directions:
○ To estimate the posterior probability of labels given the data
○ To estimate the posterior probability of the data given the labels

21 Generalizing EM

Models with Latent Variables are More Compact

● Behavior-Disease-Symptom network with three behaviors, a disease, and
three symptoms has 2×3 + 54 + 3×6 parameters = 78

● Behavior-Symptom network with the same three behaviors and three
symptoms but no latent variable has 2×3 + 54 + 162 + 486 parameters = 708

22 Generalizing EM

● Each observed variable
has 3 values (none,
moderate, severe), and is
shown with how many
parameters it has

Latent Variable Models

● The validity of the hidden variable (e.g., part-of-speech tag; disease)
depends on empirical evidence
○ Explanatory power of the hidden variable for multiple questions
○ Ability of trained judges to agree on its presence or absence

● Given that it can be difficult to get (enough) labeled data, EM can be
used to estimate the model parameters

23 Generalizing EM

Unsupervised Clustering of Continuous Data

● Gaussian distribution
○ Many things in nature, e.g., attributes of Iris species: sepal length, sepal

width, petal length, petal width
○ Iris-Setosa dataset

● Mixture of Gaussians
○ Three species of iris

24
Iris Setosa dataset, Scikit Learn

○ Gaussian Mixture Model can identify three
distinct clusters in an unsupervised manner
using EM

Generalizing EM

Gaussian Mixture Model

● Parameters of a mixture of Gaussians are
○ The weight of each component
○ The mean of each component
○ The covariance of each component

25

Gaussian Mixture
Model

500 data points
sampled from the model

Model reconstructed
from the data by EM

Generalizing EM

Norvig & Russell, 3rd Ed., Fig 20.12

General Form of EM

● Where x is all the observed values in all the examples, Z is all the hidden
variables for all the examples, and θ is all the parameters
○ E-step is the summation over P(Z=z | x, θ(k)), which is the posterior of the hidden

variables given the data

○ M-step is the maximization of the expected log likelihood L(x,Z = z|θ )
● For mixtures of Gaussians, the hidden variables are the Zijs, where Zij=1 if example j was

generated by component i
● For Bayes nets, Zij is the value of unobserved variable Xi in example j
● For HMMs, Zjt is the state of the sequence in example j at time t
● Many improvements exist, and many other applications of EM are possible

26 Generalizing EM

Summary

● The formal specification of multinomial Naive Bayes identifies two sets
of random variables (classes and attributes), and the parameters for
each

● For unlabelled data where the examples are represented as attribute
value pairs, EM can be used to find the NB parameters

● EM is not guaranteed to find a global maximum, but finds a local
maximum, and often performs very well

● A very general form of EM can be given that encompasses multiple
applications: HMMs, Naive Bayes, continuous data, and Bayes’ Nets

27Summary, Wk 10, Mtg 29