Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 4, 2015
Today:
• Graphical models
• Bayes Nets:
• EM
• Mixture of Gaussian
clustering
• Learning Bayes Net structure (Chow-Liu)
Readings:
• Bishop chapter 8
• Mitchell chapter 6
Learning of Bayes Nets
• Fourcategoriesoflearningproblems
– Graph structure may be known/unknown
– Variable values may be fully observed / partly unobserved
• Easycase:learnparametersforgraphstructureis known, and data is fully observed
• Interestingcase:graphknown,datapartlyknown
• Gruesomecase:graphstructureunknown,datapartly unobserved
EM Algorithm – Informally
EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S})
Begin with arbitrary choice for parameters θ
Iterate until convergence:
• E Step: estimate the values of unobserved Z, using θ
• M Step: use observed values plus E-step estimates to derive a better θ
Guaranteed to find local maximum. Each iteration increases
EM Algorithm – Precisely
EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Define
Iterate until convergence:
• E Step: Use X and current θ to calculate P(Z|X,θ) • M Step: Replace current θ by
Guaranteed to find local maximum. Each iteration increases
E Step: Use X, θ, to Calculate P(Z|X,θ) observed X={F,A,H,N}, Flu
Allergy
Sinus
Nose
unobserved Z={S}
• How? Bayes net inference problem.
Headache
let’s use p(a,b) as shorthand for p(A=a, B=b)
EM and estimating
observed X = {F,A,H,N}, unobserved Z={S}
Flu
Sinus
Allergy
Nose
Headache
E step: Calculate P(Zk|Xk; θ) for each training example, k
M step: update all relevant parameters. For example:
Recall MLE was:
EM and estimating
Flu
Sinus
Allergy
More generally, Headache Nose Given observed set X, unobserved set Z of boolean values
E step: Calculate for each training example, k
the expected value of each unobserved variable in each training example
M step:
Calculate similar to MLE estimates, but replacing each count by its expected count
Using Unlabeled Data to Help Train Naïve Bayes Classifier
Learn P(Y|X)
Y
Y
X1
X2
X3
X4
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
?
0
1
1
0
?
0
1
0
1
X1 X2 X3 X4
EM and estimating
Given observed set X, unobserved set Y of boolean values
E step: Calculate for each training example, k
the expected value of each unobserved variable Y
M step: Calculate estimates similar to MLE, but replacing each count by its expected count
MLE would be:
Experimental Evaluation
• Newsgrouppostings
– 20 newsgroups, 1000/group
• Webpageclassification
– student, faculty, course, project
– 4199 web pages
• Reutersnewswirearticles – 12,902 articles
– 90 topics categories
From [Nigam et al., 2000]
20 Newsgroups
word w ranked by P(w|Y=course) / P(w|Y ≠ course)
Using one labeled example per class
20 Newsgroups
Usupervised clustering
Just extreme case for EM with zero labeled examples…
Clustering
• Givensetofdatapoints,groupthem
• Unsupervisedlearning
• Whichpatientsaresimilar?(orwhichearthquakes, customers, faces, web pages, …)
Mixture Distributions
Model joint as mixture of multiple distributions.
Use discrete-valued random var Z to indicate which distribution is being use for each random draw
So
Mixture of Gaussians:
• AssumeeachdatapointX=
1. randomly choose Gaussian i, according to P(Z=i)
2. randomly generate a data point
Mixture of Gaussians
EM for Mixture of Gaussian Clustering
Let’s simplify to make this easier:
1. assume X=
given Z.
2. assume only 2 clusters (values of Z), and
3. Assume σ known, π1 … πK, μ1i …μKi unknown
Z
Observed: X=
X1 X2 X3 X4
EM
Given observed variables X, unobserved Z Define
where
Z
X1 X2 X3 X4
Iterate until convergence:
• E Step: Calculate P(Z(n)|X(n),θ) for each example X(n). Use this to construct
• M Step: Replace current θ by
EM – E Step
Z
Calculate P(Z(n)|X(n),θ) for each observed example X(n) X(n)=
X1 X2 X3 X4
First consider update for π
EM – M Step Z
π’ has no influence
z=1 for nth example
X1 X2 X3 X4
Now consider update for μji
EM – M Step Z
μji’ has no influence
………
X1 X2 X3 X4
Compare above to MLE if Z were observable:
EM – putting it together
Z
Given observed variables X, unobserved Z Define
where
X1 X2 X3 X4
Iterate until convergence:
• E Step: For each observed example X(n), calculate P(Z(n)|X(n),θ)
• M Step: Update
Mixture of Gaussians applet
Go to: http://www.socr.ucla.edu/htmls/SOCR_Charts.html then go to Go to “Line Charts” à SOCR EM Mixture Chart • tryitwith2Gaussianmixturecomponents(“kernels”)
• tryitwith4
What you should know about EM
• Forlearningfrompartlyunobserveddata
• MLEofθ=
• EMestimate:θ=
Where X is observed part of data, Z is unobserved
• NicecaseisBayesnetofbooleanvars:
– M step is like MLE, with with unobserved values replaced by
their expected values, given the other observed values
• EMfortrainingBayesnetworks
• CanalsodevelopMAPversionofEM
• CanalsoderiveyourownEMalgorithmforyourown problem
– write out expression for
– E step: for each training example Xk, calculate P(Zk | Xk, θ) – M step: chose new θ to maximize
Learning Bayes Net Structure
How can we learn Bayes Net graph structure?
In general case, open problem
• canrequirelotsofdata(elsehighriskofoverfitting) • canuseBayesianmethodstoconstrainsearch
One key result:
• Chow-Liualgorithm:finds“best”tree-structurednetwork
• What’sbest?
– suppose P(X) is true distribution, T(X) is our tree-structured
network, where X =
– Chow-Liu minimizes Kullback-Leibler divergence:
Chow-Liu Algorithm
Key result: To minimize KL(P || T), it suffices to find the tree network T that maximizes the sum of mutual informations over its edges
Mutual information for an edge between variable A and B:
This works because for tree networks with nodes
Chow-Liu Algorithm
1. for each pair of vars A,B, use data to estimate P(A,B), P(A), P(B)
2. for each pair of vars A,B calculate mutual information
3. calculate the maximum spanning tree over the set of variables, using edge weights I(A,B)
(given N vars, this costs only O(N2) time)
4. add arrows to edges to form a directed-acyclic graph
5. learn the CPD’s for this graph
Chow-Liu algorithm example
Greedy Algorithm to find Max-Spanning Tree
1/
1/
1/
1/
1/ 1/
1/ 1/
1/
1/
1/
[courtesy A. Singh, C. Guestrin]
Bayes Nets – What You Should Know
• Representation
– Bayes nets represent joint distribution as a DAG + Conditional
Distributions
– D-separation lets us decode conditional independence assumptions
• Inference
– NP-hard in general
– For some graphs, closed form inference is feasible
– Approximate methods too, e.g., Monte Carlo methods, …
• Learning
– Easy for known graph, fully observed data (MLE’s, MAP est.)
– EM for partly observed data, known graph
– Learning graph structure: Chow-Liu for tree-structured networks – Hardest when graph unknown, data incompletely observed