CS计算机代考程序代写 chain Bayesian flex Hidden Markov Mode algorithm Lecture 16. PGM Representation

Lecture 16. PGM Representation
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne

COMP90051 Statistical Machine Learning
Next Lectures
• Representationofjointdistributions • Conditional/marginalindependence
∗ Directed vs undirected • Probabilisticinference
∗ Computing other distributions from joint
• Statisticalinference
∗ Learn parameters from (missing) data
2

COMP90051 Statistical Machine Learning
Probabilistic Graphical Models
Marriage of graph theory and probability theory. Tool of choice for Bayesian statistical learning.
We’ll stick with easier discrete case, ideas generalise to continuous.
3

COMP90051 Statistical Machine Learning
Motivation by practical importance
• Many applications ∗ Phylogenetic trees
∗ Pedigrees, Linkage analysis ∗ Error-control codes
∗ Speech recognition
∗ Document topic models
∗ Probabilisticparsing ∗ Image segmentation ∗…

discovered algorithms
∗ HMMs
∗ Kalman filters
∗ Mixture models
∗ LDA
∗ MRFs
∗ CRF
∗ Logistic, linear regression ∗…
4

COMP90051 Statistical Machine Learning
Motivation by way of comparison
Bayesian statistical learning
• Model joint distribution of X’s,Y and parameter r.v.’s
∗ “Priors”:marginalson parameters
• Training: update prior to posterior using observed data
• Prediction: output posterior, or some function of it (MAP)

PGMs aka “Bayes Nets”
Efficient joint representation
∗ Independencemadeexplicit
∗ Trade-offbetween expressiveness and need for data, easy to make
∗ Easy for practitioners to model
Algorithms to fit parameters, compute marginals, posterior

5

COMP90051 Statistical Machine Learning
Everything Starts at the Joint Distribution
• Alljointdistributionsondiscrete r.v.’s can be represented as tables
• #rowsgrowsexponentiallywith #r.v.’s
• Example:TruthTables
∗ M Boolean r.v.’s require 2M-1 rows ∗ Table assigns probability per row
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
111?
A 0.05 0.25 0.1 B
0.1
0.1 0.05
0.05
C
6

COMP90051 Statistical Machine Learning
The Good: What we can do with the joint • Probabilisticinferencefromjointonr.v.’s
∗ Computing any other distributions involving our r.v.’s • Pattern:wantadistribution,havejoint;use:
Bayes rule + marginalisation
• Example:naïveBayesclassifier
∗ Predict class 𝑦𝑦 of instance 𝒙𝒙 by maximising Pr 𝑌𝑌=𝑦𝑦|𝑿𝑿=𝒙𝒙 =Pr 𝑌𝑌=𝑦𝑦,𝑿𝑿=𝒙𝒙 = Pr 𝑌𝑌=𝑦𝑦,𝑿𝑿=𝒙𝒙
Pr(𝑿𝑿=𝒙𝒙) ∑𝑦𝑦 Pr 𝑿𝑿=𝒙𝒙,𝑌𝑌=𝑦𝑦
Recall: integration (over parameters) continuous equivalent of sum (both referred to as marginalisation)
7

COMP90051 Statistical Machine Learning
The Bad & Ugly: Tables waaaaay too large!!
• TheBad:Computationalcomplexity
∗ Tables have exponential number of rows in number of r.v.’s ∗ Thereforepoor space & time to marginalise
• TheUgly:Modelcomplexity
∗ Way too flexible
∗ Way too many parameters to fit need lots of data OR will overfit
• Antidote:assumeindependence!
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
111?
8

COMP90051 Statistical Machine Learning
Example: You’re late!
• Modeling a tardy lecturer. Boolean r.v.’s ∗ T:Benteachestheclass
∗ S:Itissunny(o.w.badweather)
∗ L:Thelecturerarriveslate(o.w.ontime)
• Assume: Ben sometimes delayed by bad weather, Ben more likely late than other lecturers
∗ Pr𝑆𝑆|𝑇𝑇 =Pr𝑆𝑆,Pr𝑆𝑆 =0.3,Pr𝑇𝑇 =0.6
Pr 𝐿𝐿=𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|…
T
False True
S
False 0.1 0.2
True 0.05 0.1
• Lateness not independent on weather, lecturer
∗ Need Pr 𝐿𝐿|𝑇𝑇 = 𝑡𝑡, 𝑆𝑆 = 𝑠𝑠 for all combinations • Need just 6 parameters
9

COMP90051 Statistical Machine Learning
Independence: not a dirty word
Pr 𝑆𝑆,𝑇𝑇 factorstoPr 𝑆𝑆 Pr 𝑇𝑇 Pr 𝐿𝐿|𝑇𝑇, 𝑆𝑆 modelled in full
Pr 𝐿𝐿, 𝑇𝑇, 𝑆𝑆 modelled in full
• Independence assumptions
∗ Can be reasonable in light of domain expertise ∗ Allow us to factorKey to tractable models
Lazy Lecturer Model
Our model with 𝑆𝑆, 𝑇𝑇 independence
Assumption-free model
Model details
# params
2
4
7
10

COMP90051 Statistical Machine Learning
Factoring Joint Distributions
• ChainRule:foranyorderingofr.v.’scanalwaysfactor: 𝑘𝑘
Pr 𝑋𝑋1,𝑋𝑋2,…,𝑋𝑋𝑘𝑘 = �𝑖𝑖=1Pr 𝑋𝑋𝑖𝑖|𝑋𝑋𝑖𝑖+1,…,𝑋𝑋𝑘𝑘
• Model’sindependenceassumptionscorrespondto
– Example unconditional indep.: Pr 𝑋𝑋 |𝑋𝑋 = Pr 𝑋𝑋 121
– Dropping conditioning r.v.’s in the factors!
– Example conditional indep.: Pr 𝑋𝑋1|𝑋𝑋2, 𝑋𝑋3 = Pr 𝑋𝑋1|𝑋𝑋2
• Example: independent r.v.’s Pr 𝑋𝑋1, … , 𝑋𝑋𝑘𝑘 = ∏𝑘𝑘 Pr 𝑋𝑋𝑖𝑖
𝑖𝑖=1
• Simpler factors: speed up inference and avoid overfitting
11

COMP90051 Statistical Machine Learning
Directed PGM
• Random variables
• Nodes
• Edges (acyclic)
∗ Node table: Pr 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐|𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑝𝑝𝑡𝑡𝑠𝑠
• Conditional dependence
∗ Childdirectlydependson parents
• Joint factorisation
Pr 𝑋𝑋 ,𝑋𝑋 ,…,𝑋𝑋 =∏𝑘𝑘 Pr 𝑋𝑋|𝑋𝑋 ∈𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑝𝑝𝑡𝑡𝑠𝑠(𝑋𝑋)
12𝑘𝑘𝑖𝑖=1𝑖𝑖𝑗𝑗 𝑖𝑖
ST L
Pr 𝑆𝑆 Pr 𝑇𝑇 Pr 𝐿𝐿|𝑆𝑆,𝑇𝑇
Tardy Lecturer Example
12

COMP90051 Statistical Machine Learning
Example: Nuclear power plant
• Core temperature
Temperature Gauge Pr(𝐶𝐶𝑇𝑇𝐿𝐿)  Alarm
CTL FG Pr(𝐹𝐹𝐹𝐹)
Pr 𝐹𝐹𝐺𝐺𝐿𝐿 𝐶𝐶𝑇𝑇𝐿𝐿, 𝐹𝐹𝐹𝐹) AS Pr 𝐹𝐹𝑆𝑆 𝐹𝐹𝐺𝐺𝐿𝐿,𝐹𝐹𝐹𝐹)
• Model uncertainty in monitoring failure
Pr(𝐹𝐹𝐹𝐹) FA
GRL
∗ GRL: gauge reads low
∗ CTL: core temperature low ∗ FG: faulty gauge
∗ FA: faulty alarm
∗ AS: alarm sounds
• PGMs to the rescue!
Pr(𝐶𝐶𝑇𝑇𝐿𝐿, 𝐹𝐹𝐹𝐹, 𝐹𝐹𝐹𝐹, 𝐹𝐹𝐺𝐺𝐿𝐿, 𝐹𝐹𝑆𝑆) given by
Pr(𝐹𝐹𝑆𝑆|𝐹𝐹𝐹𝐹, 𝐹𝐹𝐺𝐺𝐿𝐿) Pr(𝐹𝐹𝐹𝐹) Pr(𝐹𝐹𝐺𝐺𝐿𝐿|𝐶𝐶𝑇𝑇𝐿𝐿, 𝐹𝐹𝐹𝐹) Pr(𝐶𝐶𝑇𝑇𝐿𝐿) Pr(𝐹𝐹𝐹𝐹)
Joint
13

COMP90051 Statistical Machine Learning
Naïve Bayes
Y
𝑌𝑌 ~ Bernoulli 𝜃𝜃
Aside: Bernoulli is just Binomial with count=1
𝑋𝑋𝑗𝑗|𝑌𝑌~Bernoulli 𝜃𝜃𝑗𝑗,𝑌𝑌 = Pr 𝑋𝑋1|𝑌𝑌 Pr 𝑋𝑋2|𝑋𝑋1,𝑌𝑌 …Pr 𝑋𝑋𝑑𝑑|𝑋𝑋1,…,𝑋𝑋𝑑𝑑−1,𝑌𝑌 Pr 𝑌𝑌
X1 … Xd
Pr 𝑌𝑌,𝑋𝑋1,…,𝑋𝑋𝑑𝑑
=Pr 𝑋𝑋1,…,𝑋𝑋𝑑𝑑,𝑌𝑌
=Pr 𝑋𝑋1|𝑌𝑌 Pr 𝑋𝑋2|𝑌𝑌 …Pr 𝑋𝑋𝑑𝑑|𝑌𝑌 Pr 𝑌𝑌 Prediction: predict label maximising Pr 𝑌𝑌|𝑋𝑋1, … , 𝑋𝑋𝑑𝑑
14

COMP90051 Statistical Machine Learning
Short-hand for repeats: Plate notation
X1 … Xd YY
= =
Xi i=1..d
X1 … Xd
Xi i=1..d
15

COMP90051 Statistical Machine Learning
PGMs frequentist OR Bayesian…
• PGMs represent joints, which are central to Bayesian • CatchisthatBayesiansadd:nodeperparameters,
with table being the parameter’s prior
𝜃𝜃
1,0 1,1
X1
𝜃𝜃
Y 𝜃𝜃

𝑌𝑌 ~ Bernoulli 𝜃𝜃 𝑋𝑋𝑗𝑗|𝑌𝑌~Bernoulli 𝜃𝜃𝑗𝑗,𝑌𝑌
𝜃𝜃
𝜃𝜃 𝑑𝑑,0 𝑑𝑑,1
Xd
𝜃𝜃′𝑠𝑠 ~ 𝐵𝐵𝑡𝑡𝑡𝑡𝑝𝑝
16

COMP90051 Statistical Machine Learning
Undirected PGMs
Undirected variant of PGM, parameterised by arbitrary positive valued functions of the variables, and global normalisation.
A.k.a. Markov Random Field.
17

COMP90051 Statistical Machine Learning
Undirected vs directed
Undirected PGM
• Graph
∗ Edges undirected
• Probability
Directed PGM
• Graph
∗ Edged directed
• Probability
∗ Each node a r.v.
∗ Each node has conditional
∗ Each node a r.v.
∗ Each clique C has “factor”
ψ 𝑋𝑋:𝑗𝑗∈𝐶𝐶 ≥0 𝑝𝑝 𝑋𝑋|𝑋𝑋 ∈𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑝𝑝𝑡𝑡𝑠𝑠(𝑋𝑋)
𝐶𝐶𝑗𝑗𝑖𝑖𝑗𝑗𝑖𝑖
∗ Joint ∝ product of factors ∗ Joint = product of cond’ls
Key difference = normalisation
18

COMP90051 Statistical Machine Learning
Undirected PGM formulation
• Basedonnotionof
∗ Clique: a set of fully connected
nodes (e.g., A-D, C-D, C-D-F)
∗ Maximal clique: largest cliques in graph (not C-D, due to C-D-F)
• Jointprobabilitydefinedas
AE
BD
C
F
∗ where ψ is a positive function and Z is the normalising ‘partition’ function
19

COMP90051 Statistical Machine Learning
Directed to undirected
• DirectedPGMformulatedas where 𝛑𝛑 indexes parents.
• EquivalenttoU-PGMwith
∗ each conditional probability term is included in one factor
function, ψc
∗ clique structure links groups of variables, i.e., ∗ normalisation term trivial, Z = 1
20

COMP90051 Statistical Machine Learning
CTL
FG
1. copynodes
2. copyedges,undirected 3. ‘moralise’parentnodes
FA
GRL
AS
CTL
FG
FA
GRL
AS
21

COMP90051 Statistical Machine Learning
• Pros
Why U-PGM?
∗ generalisation of D-PGM
∗ simpler means of modelling without the need for per- factor normalisation
∗ general inference algorithms use U-PGM representation (supporting both types of PGM)
• Cons
∗ (slightly) weaker independence
∗ calculating global normalisation term (Z) intractable in
general (but tractable for chains/trees, e.g., CRFs)
22

COMP90051 Statistical Machine Learning
Example PGMs
The hidden Markov model (HMM); lattice Markov random field (MRF);
Conditional random field (CRF)
23

COMP90051 Statistical Machine Learning
The HMM (and Kalman Filter) • Sequentialobservedoutputsfromhiddenstate
• TheKalmanfiltersamewithcontinuousGaussianr.v.’s •ACRFisthe 𝑜𝑜1 𝑞𝑞 𝑜𝑜2 𝑞𝑞 𝑜𝑜3 𝑞𝑞 𝑜𝑜4 𝑞𝑞
undirected analogue 1 2 3 4
24

COMP90051 Statistical Machine Learning
HMM Applications
• NLP – part of speech tagging: given words in sentence,
infer hidden parts of speech
“I love Machine Learning”noun, verb, noun, noun
• Speech recognition: given waveform, determine phonemes
• Biological sequences: classification, search, alignment
• Computer vision: identify who’s walking in video, tracking
25

COMP90051 Statistical Machine Learning
Fundamental HMM Tasks
HMM Task
PGM Task
Evaluation. Given an HMM 𝜇𝜇 and observation sequence 𝑂𝑂, determine likelihood Pr(𝑂𝑂|𝜇𝜇)
Probabilistic inference
Decoding. Given an HMM 𝜇𝜇 and observation sequence 𝑂𝑂, determine most probable hidden state sequence 𝑄𝑄
Learning. Given an observation sequence 𝑂𝑂 and set of states, learn parameters
MAP point estimate
Statistical inference
𝐹𝐹,𝐵𝐵,Π
26

COMP90051 Statistical Machine Learning
Pixel labelling tasks in Computer Vision
Interactive figure-ground segmentation (Boykov & Jolly 2011)
Semantic labelling (Gould et al. 09)
Denoising (Felzenszwalb & Huttenlocher 04)
27

COMP90051 Statistical Machine Learning
What these tasks have in common
• Hidden state representing semantics of image
∗ Semantic labelling: Cow vs. tree vs. grass vs. sky vs. house
∗ Fore-back segment: Figure vs. ground ∗ Denoising: Clean pixels
• Pixelsofimage
∗ What we observe of hidden state
• RemindyouofHMMs?
28

COMP90051 Statistical Machine Learning
A hidden square-lattice Markov random field
• Hiddenstates: square-lattice model
∗ Boolean for two-class states
∗ Discrete for multi-class
∗ Continuous for denoising
• Pixels:observedoutputs ∗ Continuous e.g. Normal
𝑖𝑖𝑗𝑗
𝑜𝑜 𝑞𝑞
𝑖𝑖𝑗𝑗
29

COMP90051 Statistical Machine Learning
Application to sequences: CRFs
• ConditionalRandomField:Samemodelappliedto sequences
∗ observed outputs are words, speech, amino acids etc ∗ states are tags: part-of-speech, phone, alignment…
• CRFsarediscriminative,modelP(Q|O)
∗ versus HMM’s which are generative, P(Q,O)
∗ undirected PGM more general and expressive
𝑜𝑜1 𝑞𝑞1 𝑜𝑜2 𝑞𝑞2 𝑜𝑜3 𝑞𝑞3 𝑜𝑜4 𝑞𝑞4
30

COMP90051 Statistical Machine Learning
Summary
• Probabilisticgraphicalmodels
∗ Motivation: applications, unifies algorithms
∗ Motivation: ideal tool for Bayesians
∗ Independence lowers computational/model complexity ∗ PGMs: compact representation of factorised joints
∗ U-PGMs
• ExamplePGMsandapplications
• Nexttime:eliminationforprobabilisticinference
31