Lecture 17. PGM Probabilistic Inference PGM Statistical Inference
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
Probabilistic inference on PGMs
Computing marginal and conditional distributions from the joint of a PGM using Bayes rule and marginalisation.
This deck: how to do it efficiently.
2
COMP90051 Statistical Machine Learning
Two familiar examples • Naïve Bayes (frequentist/Bayesian)
Y
…
𝜃𝜃
X
Pr 𝑌𝑌,𝑋𝑋1,…,𝑋𝑋𝑑𝑑 Pr 𝑌𝑌,𝑋𝑋1,…,𝑋𝑋𝑑𝑑 ∗Pr𝑌𝑌|𝑋𝑋,…,𝑋𝑋 = =
∗ Choosesmostlikelyclassgivendata
1𝑑𝑑
• Data 𝑋𝑋|𝜃𝜃~𝑁𝑁 𝜃𝜃, 1 with prior 𝜃𝜃~𝑁𝑁 0,1 (Bayesian)
Pr 𝑋𝑋1,…,𝑋𝑋𝑑𝑑 ∑𝑦𝑦 Pr 𝑌𝑌=𝑦𝑦,𝑋𝑋1,…,𝑋𝑋𝑑𝑑 ∗ Givenobservation𝑋𝑋=𝑥𝑥updateposterior
X1
Xd
∗ Pr𝜃𝜃|𝑋𝑋 =Pr𝜃𝜃,𝑋𝑋 = Pr𝜃𝜃,𝑋𝑋 Pr 𝑋𝑋 ∑𝜃𝜃 Pr 𝜃𝜃,𝑋𝑋
• Joint + Bayes rule + marginalisationanything
3
COMP90051 Statistical Machine Learning
Nuclear power plant
High temp HT
FG
Faulty gauge
• Alarmsounds;meltdown?!
• Pr𝐻𝐻𝑇𝑇𝐴𝐴𝐴𝐴=𝑡𝑡 =Pr𝐻𝐻𝑇𝑇,𝐴𝐴𝐴𝐴=𝑡𝑡 FA
Pr(𝐴𝐴𝐴𝐴=𝑡𝑡)
Faulty alarm
High gauge Alarm sounds
= ∑𝐹𝐹𝐺𝐺, 𝐻𝐻𝐺𝐺, 𝐹𝐹𝐹𝐹 Pr 𝐴𝐴𝐴𝐴=𝑡𝑡, 𝐹𝐹𝐴𝐴, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐻𝐻, 𝐻𝐻𝑇𝑇 ∑𝐹𝐹𝐺𝐺, 𝐻𝐻𝐺𝐺, 𝐹𝐹𝐹𝐹, 𝐻𝐻𝑇𝑇′ Pr 𝐴𝐴𝐴𝐴=𝑡𝑡, 𝐹𝐹𝐴𝐴, 𝐻𝐻𝑅𝑅, 𝐹𝐹𝐻𝐻, 𝐻𝐻𝑇𝑇′
• Numerator (denominator similar)
expanding out sums, joint summing once over 25 table
HG
=� � � Pr 𝐻𝐻𝑇𝑇 Pr 𝐻𝐻𝐻𝐻|𝐻𝐻𝑇𝑇,𝐹𝐹𝐻𝐻 Pr 𝐹𝐹𝐻𝐻 Pr 𝐴𝐴𝐴𝐴=𝑡𝑡|𝐹𝐹𝐴𝐴,𝐻𝐻𝐻𝐻 Pr 𝐹𝐹𝐴𝐴 𝐹𝐹𝐻𝐻 𝐻𝐻𝐻𝐻 𝐹𝐹𝐴𝐴
AS
=Pr 𝐻𝐻𝑇𝑇 � Pr 𝐹𝐹𝐻𝐻 � Pr 𝐻𝐻𝐻𝐻|𝐻𝐻𝑇𝑇,𝐹𝐹𝐻𝐻 � Pr 𝐹𝐹𝐴𝐴 Pr 𝐴𝐴𝐴𝐴=𝑡𝑡|𝐹𝐹𝐴𝐴,𝐻𝐻𝐻𝐻 𝐹𝐹𝐻𝐻 𝐻𝐻𝐻𝐻 𝐹𝐹𝐴𝐴
distributing the sums as far down as possible summing over several smaller tables
4
COMP90051 Statistical Machine Learning
Nuclear power plant (cont.)
=Pr 𝐻𝐻𝑇𝑇 ∑ Pr 𝐹𝐹𝐻𝐻 ∑ Pr 𝐻𝐻𝐻𝐻|𝐻𝐻𝑇𝑇,𝐹𝐹𝐻𝐻 ∑ Pr 𝐹𝐹𝐴𝐴 Pr 𝐴𝐴𝐴𝐴=𝑡𝑡|𝐹𝐹𝐴𝐴,𝐻𝐻𝐻𝐻 𝐹𝐹𝐻𝐻𝐻𝐻𝐻𝐻𝐹𝐹𝐴𝐴 FA
HT
HG AS
FG
eliminate AS: since AS observed, really a no-op
= Pr 𝐻𝐻𝑇𝑇 ∑𝐹𝐹𝐻𝐻 Pr 𝐹𝐹𝐻𝐻 ∑𝐻𝐻𝐻𝐻 Pr 𝐻𝐻𝐻𝐻|𝐻𝐻𝑇𝑇,𝐹𝐹𝐻𝐻 ∑𝐹𝐹𝐴𝐴 Pr 𝐹𝐹𝐴𝐴 𝑚𝑚𝐴𝐴𝐴𝐴 𝐹𝐹𝐴𝐴,𝐻𝐻𝐻𝐻
HT
FG
eliminate FA: multiplying 1×2 by 2×2 FA
HG
=Pr𝐻𝐻𝑇𝑇 ∑𝐹𝐹𝐻𝐻Pr𝐹𝐹𝐻𝐻 ∑𝐻𝐻𝐻𝐻Pr𝐻𝐻𝐻𝐻|𝐻𝐻𝑇𝑇,𝐹𝐹𝐻𝐻 𝑚𝑚𝐹𝐹𝐴𝐴 𝐻𝐻𝐻𝐻 eliminate HG: multiplying 2x2x2 by 2×1
=Pr𝐻𝐻𝑇𝑇 ∑𝐹𝐹𝐻𝐻Pr𝐹𝐹𝐻𝐻 𝑚𝑚𝐻𝐻𝐻𝐻 𝐻𝐻𝑇𝑇,𝐹𝐹𝐻𝐻 eliminate FG: multiplying 1×2 by 2×2
= Pr 𝐻𝐻𝑇𝑇 𝑚𝑚𝐹𝐹𝐻𝐻 𝐻𝐻𝑇𝑇 HT
HT FG HG
Multiplication of tables, followed by summing, is actually matrix multiplication
f t
HT FG
HG
ft
1.0
0
0.8
0.2
𝑚𝑚𝐹𝐹𝐴𝐴 𝐻𝐻𝐻𝐻 =
0.4 X FA
FA
ft
0.6
5
COMP90051 Statistical Machine Learning
Elimination algorithm
Eliminate (Graph 𝐻𝐻, Evidence nodes 𝐸𝐸, Query nodes 𝑄𝑄) 1. Choose node ordering 𝐼𝐼 such that 𝑄𝑄 appears last
2. 3.
4. 5.
Initialise empty list active For each node 𝑋𝑋 in 𝐻𝐻
initialise evidence
marginalise normalise
𝑖𝑖
a) Append Pr 𝑋𝑋𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝(𝑋𝑋𝑖𝑖)) to active
For each node 𝑋𝑋𝑖𝑖 in 𝐸𝐸
a) Append 𝛿𝛿(𝑋𝑋𝑖𝑖, 𝑥𝑥𝑖𝑖) to active
Foreach 𝑖𝑖 in 𝐼𝐼
a) potentials = Remove tables referencing 𝑋𝑋𝑖𝑖 from active
b) 𝑁𝑁𝑖𝑖 = nodes other than 𝑋𝑋𝑖𝑖 referenced by tables
c) Table 𝜙𝜙𝑖𝑖(𝑋𝑋𝑖𝑖, 𝑋𝑋𝑁𝑁𝑖𝑖 ) = product of tables
d) Table𝑚𝑚𝑖𝑖𝑋𝑋𝑁𝑁 =∑𝑋𝑋𝑖𝑖𝜙𝜙𝑖𝑖(𝑋𝑋𝑖𝑖,𝑋𝑋𝑁𝑁𝑖𝑖) 𝑖𝑖
e) Append 𝑚𝑚 (𝑋𝑋 ) to active 𝑖𝑖 𝑁𝑁𝑖𝑖
6.
Return Pr(𝑋𝑋𝑄𝑄|𝑋𝑋𝐸𝐸 = 𝑥𝑥𝐸𝐸) = 𝜙𝜙𝑄𝑄(𝑋𝑋𝑄𝑄)/ ∑𝑋𝑋𝑄𝑄 𝜙𝜙𝑄𝑄(𝑋𝑋𝑄𝑄)
6
COMP90051 Statistical Machine Learning
Runtime of elimination algorithm
HT FG FA HG
AS
HT FG HT FG HG HG
FA
HT FG HT PGM after successive eliminations
FG
“reconstructed” graph From process called moralisation
HT
FA
AS
HG
• Each step of elimination
∗ Removesanode
∗ Connectsnode’sremainingneighbours forms a clique in the “reconstructed” graph
(cliques are exactly r.v.’s involved in each sum)
• Time complexity exponential in largest clique
• Different elimination orderings produce different cliques ∗ Treewidth:minimumoverorderingsofthelargestclique
∗ Bestpossibletimecomplexityisexponentialinthetreewidth
7
COMP90051 Statistical Machine Learning
Probabilistic inference by simulation
• Exactprobabilisticinferencecanbeexpensive/impossible
• Canweapproximatenumerically?
• Idea:samplingmethods
∗ Cheaply sample from desired distribution
∗ Approximate distribution by histogram of samples
0.2 0.15 0.1 0.05 0
1 2 3 4 5 6 7 8 9 10
8
COMP90051 Statistical Machine Learning
•
Algorithm: sample once from joint
1. Order nodes’ parents before children (topological order)
12 3
5
Monte Carlo approx probabilistic inference
2. Repeat
a) For each node 𝑋𝑋 4
𝑖𝑖
i. Index into Pr(𝑋𝑋𝑖𝑖|𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑝𝑝(𝑋𝑋𝑖𝑖)) with parents’ values
ii. Sample X from this distribution
b) Together𝑿𝑿=(𝑋𝑋 ,…,𝑋𝑋 )isasamplefromthejoint
i1𝑑𝑑
• Algorithm: sampling from Pr(𝑋𝑋 |𝑋𝑋 = 𝑥𝑥 )
2. Initialise set 𝐴𝐴 empty; Repeat
𝑄𝑄𝐸𝐸𝐸𝐸
1. Order nodes’ parents before children
1. Sample 𝑿𝑿 from joint
2. If𝑋𝑋 =𝑥𝑥 thenadd𝑋𝑋 to𝐴𝐴 𝐸𝐸𝐸𝐸𝑄𝑄
3. Return: Histogram of 𝐴𝐴, normalising counts via divide by |𝐴𝐴|
• Sampling++: Importance weighting, Gibbs, Metropolis-Hastings
9
COMP90051 Statistical Machine Learning
Alternate forms of probabilistic inference
• Eliminationalgorithmproducessinglemarginal
• Sum-productalgorithmontrees
∗ 2x cost, supplies all marginals
∗ Name: Marginalisation is just sum of product of tables ∗ “Identical” variants: Max-product, for MAP estimation
• Ingeneralthesearemessage-passingalgorithms ∗ Can generalise beyond trees (beyond scope):
junction tree algorithm, loopy belief propagation
• VariationalBayes:approximationviaoptimisation
10
COMP90051 Statistical Machine Learning
Statistical inference on PGMs
Learning from data – fitting probability tables to observations (eg as a frequentist; a Bayesian would just use probabilistic inference to update prior to posterior)
11
COMP90051 Statistical Machine Learning
Where are we?
• Representationofjointdistributions ∗ PGMs encode conditional independence
• Independence,d-separation
• Probabilisticinference
∗ Computing other distributions from joint ∗ Elimination, sampling algorithms
• Statisticalinference
∗ Learn parameters from data
12
COMP90051 Statistical Machine Learning
Have PGM, Some observations, No tables…
False
?
True
?
False
?
True
?
HTi ASi
FGi
HT
false
true
FG
f
t
f
t
False
?
?
?
?
True
?
?
?
?
False
?
True
?
FAi
HGi
FA
false
true
HG
f
t
f
t
False
?
?
?
?
True
?
?
?
?
i=1..n
13
COMP90051 Statistical Machine Learning
Fully-observed case is “easy”
• Max-Likelihood Estimator (MLE) says ∗ If we observe all r.v.’s 𝑿𝑿 in a PGM
independently 𝑝𝑝 times 𝒙𝒙𝑖𝑖
∗ Then maximise the full joint
𝑛𝑛 𝑗𝑗 𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑡𝑡𝑝𝑝 𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑡𝑡𝑝𝑝 𝑗𝑗 argmax∏ ∏ 𝑝𝑝 𝑋𝑋 =𝑥𝑥 |𝑋𝑋 =𝑥𝑥
𝜃𝜃∈Θ
𝑖𝑖=1𝑗𝑗𝑖𝑖 𝑖𝑖
• Decomposes easily, leads to counts-based estimates ∗ Maximise log-likelihood instead; becomes sum of logs
𝑛𝑛 𝑗𝑗 𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑡𝑡𝑝𝑝 𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑡𝑡𝑝𝑝 𝑗𝑗 argmax∑ ∑ log𝑝𝑝 𝑋𝑋 =𝑥𝑥 |𝑋𝑋 =𝑥𝑥
𝜃𝜃∈Θ
𝑖𝑖=1𝑗𝑗 𝑖𝑖 𝑖𝑖
∗ Big maximisation of all parameters together, decouples into small independent problems
• Example is training a naïve Bayes classifier
FAi
HTi
HGi ASi
FGi
i=1..n
14
COMP90051 Statistical Machine Learning
Example: Fully-observed case
# 𝒙𝒙𝒊𝒊|𝑭𝑭𝑭𝑭𝒊𝒊 = 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒕𝒕 𝒏𝒏
false
?
true
?
false
?
true
?
HTi ASi
FGi
HT
false
true
FG
f
t
f
t
false
?
?
?
?
true
?
?
?
?
false
?
true
?
FAi
HGi
FA
false
true
HG
f
t
f
t
false
?
?
?
?
true
?
?
?
?
i=1..n
# 𝒙𝒙𝒊𝒊|𝑭𝑭𝑭𝑭𝒊𝒊 = 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕 𝒏𝒏
# 𝒙𝒙𝒊𝒊|𝑯𝑯𝑭𝑭𝒊𝒊 = 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕, 𝑯𝑯𝑻𝑻𝒊𝒊 = 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒕𝒕, 𝑭𝑭𝑭𝑭𝒊𝒊 = 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒕𝒕 # 𝒙𝒙𝒊𝒊|𝑯𝑯𝑻𝑻𝒊𝒊 = 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒕𝒕, 𝑭𝑭𝑭𝑭𝒊𝒊 = 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒕𝒕
15
COMP90051 Statistical Machine Learning
Presence of unobserved variables trickier
• ButmostPGMsyou’llencounterwillhavelatent,or unobserved, variables
• WhathappenstotheMLE?
∗ Maximise likelihood of observed data only
𝑛𝑛 𝑗𝑗 𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑡𝑡𝑝𝑝 𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑛𝑛𝑡𝑡𝑝𝑝 𝑗𝑗 ∗ argmax∏ ∑ ∏ 𝑝𝑝 𝑋𝑋 =𝑥𝑥 |𝑋𝑋 =𝑥𝑥
∗ Marginalise full joint to get to desired “partial” joint
𝜃𝜃∈Θ
𝑖𝑖=1 latent 𝑗𝑗 𝑗𝑗 𝑖𝑖 𝑖𝑖
FAi
HTi
HGi ASi
FGi
i=1..n
∗ This won’t decouple – oh-no’s!! Use EM algorithm!
16
COMP90051 Statistical Machine Learning
Summary
• Probabilistic inference on PGMs
∗ What is it and why do we care?
∗ Eliminationalgorithm;complexityviacliques
∗ Monte Carlo approaches as alternate to exact integration
• Statistical inference on PGMs
∗ What is it and why do we care?
∗ StraightMLEforfully-observeddata
∗ EMalgorithmformixedlatent/observeddata
• Next time: extra (some more on HMMs, message passing, etc.)
17