COMS 4771 Probabilistic Reasoning via Graphical Models
Nakul Verma
Last time…
• Dimensionality Reduction
Linear vs non-linear Dimensionality Reduction
• Principal Component Analysis (PCA)
• Non-linear methods for doing dimensionality reduction
Graphical Models
A probabilistic model where a graph represents the conditional dependence structure among the variables.
Provides a compact representation of the joint distribution!
Example:
Four variables of interest – cloudiness, raining, sprinkler, grass_wet
Possible Causality Structure:
Cloudy (C)
Inference questions:
• What is the probability it rained given the grass is wet?
• What is the chance that the sprinkler was off given grass is wet and it is not cloudy?
Learning questions:
• What is the most likely GM structure and connection weights that models the data?
Sprinkler
(S) (R)
GrassWet (G)
Rainy
Graphical Models: Representation
There are two kinds of Graphical Models Directed models – Bayesian Networks
C SR
G
Edge direction typically denotes potential causality
Undirected models – Markov Random Fields (MRFs) Edge connection typically denotes potential co-occurrence
Bayesian Networks
What is the joint probability for these variables?
C SR
G
Chain rule
In general:
due to the (in)dependencies asserted by the parent-child relationships
That is: a variable is independent of its ancestors given the parents.
Bayesian Networks: Inference
P(C=1)
0.5
C SR
G
C
P(S=1|C)
01
0.5 0.1
C
P(R=1|C)
01
0.2 0.8
S
R
P(G=1|S,R)
01
0101
0.0 0.9 0.9 0.99
These conditional probability tables (CPT) are enough to completely specify the joint distribution!
Bayesian Networks: Inference
Q: What is the probability of sprinkler being on given the grass is wet?
C SR
G
Bayesian Networks: Learning Parameters
Learning the parameters knowing the structure
ie, estimate the CPTs from observations
Simply do the likelihood estimates (ie, counts)
C SR
G
etc …
Issue: assigns zero prob. for unseen combinations in data. How to fix that?
Bayesian Networks: Learning Structure
Learning the unknown structure between the variables
General
• Test of conditional independencies in data
• Grow-Shrink Markov Blanket algorithm
Assumed structure:
• Tree structure: Chow-Liu algorithm
• Small cliques: variations on Chow-Liu
C
?? S?R
G
NP-hard to find the optimal structure
Markov Random Fields (MRFs)
Graphical models with undirected connections
normalizer (so things integrate to 1), aka the partition function
Clique potentials, typically the relative frequency of variable co-occurrence in a clique
Example: five variable graph
X1
X2 X3 X5 X4
What are the max-cliques?
A Closer Look at (In)dependencies in GMs
What are the (conditional) independencies asserted by the following graphical models?
(directed)
X1 X2 X3
X1 X2 X3
X1 X2 X3
(undirected)
X1 X2 X3
X1 X2 X3 X1 X2 X3
X1 X2 X3
X1 X2 X3
Relation Between Directed & Undirected GM
What are the (conditional) independencies asserted by the following
directed model?
X1 X4
X2
X5
X3
What is the equivalent undirected model?
X1 X4
X3
X2 X5
GM Special Case: Time Series Model
A time series model:
A family of distributions over a sequence of random variables X1, X2,… that is
indexed by a totally ordered indexing set (often referred to as time)
… Xt – 2
past states
Xt – 1 Xt current
state
Xt + 1
Xt + 2 future states
…
Many applications:
• Financial/Economic data over time
• Climate data
• Speech and natural language •…
Markov Models
Markov Model:
A time series model with the property:
The conditional distribution of the next state Xt+1 given all the previous states Xi (i ≤ t) only depends on the current state Xt
The corresponding graphical model:
…
…
also known as a Markov chain
Xt – 2
Xt – 1 Xt
Xt + 1
Xt + 2
Markov Chains: Distributions
To specify a Markov Chain:
Need to specify the distribution of the initial state: X1 Need to specify the conditional distribution: Xt+1 given Xt
This is often called the transition matrix
(We will focus on finite size state space, say, d different states) Initial state distribution:
Conditional distribution:
can be summarized in a d x d matrix A
A is row stochastic
Markov Chain: Example
State space: {1,2} Parameters:
What is the probability of seeing the random sequence: 2,2,2,1,1,2,2,1 ?
Markov Chain: Example – PageRank
Web graph: vertices – webpages, edges – links between webpages
link structure for 500 webpages
Question: how popular is a given webpage i ? Possible answer:
proportional to the probability that a random walk ends on page i. (for some large t)
Markov Chain: Marginals
Let’s calculate the following probabilities:
for the PageRank example, does this converge to a stable value for large t?
Markov Chain: Limiting Behavior
Question does/can P(Xt) have a limiting behavior?
Equivalent to asking:
does approach a limiting matrix
— q — — q — — q —
(with identical rows) ?
(a sufficient condition)
— q — — q — — q —
q unique whenever there is no multiplicity of eigenvalue 1
For such an A, it must satisfy:
Equivalently:
— q — — q — — q —
ie, q is the left eigenvector of A with eigenvalue 1!
such a q is called the stationary distribution of A
PageRank Example
Web graph doesn’t have a unique stationary distribution, but can add some regularity to the link matrix A. That is à = A + ε1
link structure for 500 webpages
(regularized stationary distribution)
Popularity of a given webpage i is proportional to
the ith component of the (regularized) stationary distribution
Markov Models with Unobserved Variable
Hidden Markov Model (HMM): A Markov chain on {(Xt,Yt)}t Some properties:
• Yt is unobserved / hidden variable; only Xt is observed.
• Conditioned on Yt, Xt is independent of all other variables!
The corresponding graphical model:
…
…
Yt – 2 Xt – 2
Yt – 1 Yt Xt – 1 Xt
Yt + 1 Xt + 1
Yt + 2 Xt + 2
Hidden Markov Models (HMMs) Applications
Natural Language Processing
Observed: words in a sentence
Unobserved: words’ part-of-speech or other word semantics
Bioinformatics
Observed: Amino acids in a protein
Unobserved: indicators of evolutionary conservation
Speech Recognition
Observed: Recorded speech
Unobserved: The phonemes the speaker intended to vocalize
HHMs Parameters
We will focus on discrete state space: Xt takes values { 1, …, D } (observed) Yt takes values { 1, …, K } (hidden)
We need the initial state distribution on Y1
Need to specify a K x K transition matrix A from Yt to Yt+1 Need to specify a K x D emission matrix B from Yt to Xt
Both A and B are row stochastic
HHM: Example – Dishonest Casino
HMM Parameters
Casino die-rolling game:
Randomly switch between two possible dice: one is fair and one is loaded.
π = (1,0) [the casino starts off with the fair die]
Problem: based on the sequence of rolls, guess which die was used at each time
HHM Learning and Inference Problems
Conditional Probabilities (filtering/smoothing)
• Given: parameters θ = (π, A, B), and the observation X1:T
• Goal: What is the conditional probability of Y1:T ?
Most probable sequence (decoding)
Parameter Estimation
• Given: The observations X1:T
• Goal: Find the best parameter estimate of θ
HHM: Example – Dishonest Casino
HHM: Computing the Posterior Probabilities
Filtering Problem
Can directly compute using the standard way, but that is slow and doesn’t exploit the conditional independency structure of HMMs
A popular fast algorithm:
Forward-Backward algorithm, can be done in two passes (one forward pass, one backward pass) over the states.
Decoding Problem
Most likely posterior setting of the hidden states can be computed efficiently using a dynamic programming algorithm, called Viterbi decoding algorithm
See supplementary material for detail on these algorithms
HHM: Learning the Parameters
We can use the Expectation Maximization (EM) Algorithm! Input: n observations sequences
Initialize:
Start with an initial setting / guess of parameters
E-step:
Compute conditional expectation Y given X and current parameter guess
(this can be done using the Forward-Backward algorithm)
M-step:
Given the estimate of Y and the observations X, we have the complete likelihood, so simply maximize the likelihood by taking the derivative and examine the stationary points.
See supplementary material for details
What We Learned…
• Graphical Models
Bayesian Networks and Markov Random Fields
• Doing inference and learning on graphical models
• Markov Models
• Hidden Markov Models
• Bayesian Networks
Questions?