PowerPoint Presentation
CSE 3521: Bayesian Networks
(DAG Probabilistic Graphical Models)
Copyright By PowCoder代写 加微信 powcoder
[Many slides are adapted from previous CSE 5521 course at OSU.]
Probabilistic models for classification
: output the class with the largest posterior probability
Generative models
Bayes’ rules:
Classification rule:
Naïve Bayes:
Conditional independence “assumption” for
Use MLE, optimizing each term or by the corresponding data
Inference:
Probabilistic graphical models (PGMs)
An efficient way to encode “conditional independence”
From PGMs, we can decompose a joint probability much more efficiently
Probabilistic Inference on PGMs
Independence in PGMs
Problem: Dependent feature variables
Most real-world data have high-dimensional and correlated variables
Pixels in an image
Words in a document (“I”, “go”, “to”, “school”, “.”)
Genes in a microarray
Sometimes, even data instances are not independent
Need exponentially many parameters for the probabilistic model
Need more training data
Need more training and inference time
Question: How to build probability models?
How to compactly represent , where are the parameters?
How can we learn the parameters with a reasonable amount of data?
How can we use this distribution to infer a set of variables given another?
Ex. Given the first 100 pixels, infer the rest
Ex. Given today’s stock market, predict tomorrow’s
The chain rule of probability
Can represent any joint distribution this way
Using any ordering of the variables…
We will use or interchangeably sometimes in this lecture!
Problem: this distribution has O(2N) parameters if each of them is a binary random variable
Conditional Independence
This is the key to representing large joint distributions
Definition: X and Y are conditionally independent given Z if and only if …
the conditional joint distribution can be written as a product of the conditional marginal distributions
Equivalent properties:
Example: Markov models
“The future is independent of the past given the present”
Only O(N) parameters:
Fewer parameters, (1) faster learning, (2) fast inference, (3) and fewer training data required!
Probabilistic graphical models
First order Markov assumption is useful for 1-D sequence data
Sequences of words in a sentence or document
Question: What about 2-D images, 3-D video
Or in general arbitrary collections of variables
Gene pathways, etc…
Probabilistic graphical models
A way to represent a joint distribution by making conditional independence assumptions
Nodes represent variables
Edges (node relationship!): can be directed or undirected
directed acyclic graph (DAG): Bayesian networks
No edges indicate conditional independence assumptions
Ex: (top) C and B are conditionally independent given A
Ex: think of the graph as a data generation process
Better name: “conditional independence diagrams”
Graph terminology
Graph (V,E) consists of
A set of nodes or vertices V={1, …, V}
A set of edges {(s, t) in V}
Childs/Parents (for directed graphs)
Ancestors (for directed graphs)
Decedents (for directed graphs)
Neighbors (for any graphs)
Cycle (for any graphs)
Tree (no cycles)
Clique / Maximal Clique (for undirected graphs): (sub) complete graph
Directed graphical models
Graphical Model whose graph is a DAG
DAG: Directed acyclic graph (no cycles!)
A.K.A. Bayesian Networks
Also known as Bayes networks, belief networks
Nothing inherently Bayesian about them
Just a way of defining conditional independence
Directed graphical models
Key properties: Nodes can be ordered so that parents come before children
Topological ordering
Can be constructed from any DAG
Ordered Markov Property:
Generalization of first-order Markov Property to general DAGs
Node only depends on its parents (not other ancestors)
EX. D A | B, C
Decomposition (nodes are ordered):
Given the decomposition, you can draw the DAG
Given the DAG, you can derive the decomposition
Complete DAG: no conditional independence
Naïve Bayes
White nodes: unknown
Gray nodes: observed
Can generate data!
Markov models
First order Markov Model
Second order Markov Model
Hidden Markov Model
White nodes: unknown
Gray nodes: observed
Example: medical diagnosis (The alarm network)
Another medical diagnosis example: QMR network
QMR: Quick Medical Reference
Probabilistic graphical models (PGMs)
An efficient way to encode conditional independence
From PGMs, we can decompose a joint probability much efficiently
Probabilistic Inference on PGMs
Independence in PGMs
Probabilistic inference
Graphical Models provide a compact way to represent complex joint distributions
Question: Given a joint distribution, what can we do with it?
Answer: Main use = probabilistic inference
Estimate unknown variables from known ones
Ex. Given P(X, Y), predict the most likely assignment of Y given X=x
General form of inference
A correlated set of random variables
Joint distribution:
Assumption: parameters are known
Partition variables into:
Visible (with assignments/observations):
Goal: compute unknowns from known ones
General form of inference
Condition data by clamping visible variables to observed values
Normalize by probability of evidence
Nuisance variables
Partition hidden variables into:
Query Variables (interested):
Nuisance variables (not interested):
Inference on PGMs
B E A P(A|B,E)
+b +e +a 0.95
+b +e -a 0.05
+b -e +a 0.94
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
A J P(J|A)
-a +j 0.05
-a -j 0.95
A M P(M|A)
-a +m 0.01
-a -m 0.99
Earthquake
Inference on PGMs
B E A P(A|B,E)
+b +e +a 0.95
+b +e -a 0.05
+b -e +a 0.94
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
A J P(J|A)
-a +j 0.05
-a -j 0.95
A M P(M|A)
-a +m 0.01
-a -m 0.99
Earthquake
Inference on PGMs
B E A P(A|B,E)
+b +e +a 0.95
+b +e -a 0.05
+b -e +a 0.94
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
A J P(J|A)
-a +j 0.05
-a -j 0.95
A M P(M|A)
-a +m 0.01
-a -m 0.99
Earthquake
** Inference vs. learning
Inference:
Parameters are assumed to be known
Learning (parameter estimation):
Compute MLE or MAP estimate of the parameters
** Bayesian learning
Parameters are treated as random variables
Update by
No explicit distinction between inference and learning
Main distinction between inference and learning:
# of hidden variables grows with the size of dataset
# of parameters is fixed-sized
Probabilistic graphical models (PGMs)
An efficient way to encode conditional independence
From PGMs, we can decompose a joint probability much efficiently
Probabilistic Inference on PGMs
Independence in PGMs
Conditional independence properties
A is independent of B given C (from the graph G)
I(G) is the set of all such conditional independence assumptions encoded by G
G is an I-map for G’ iff I(G) I(G’)
Where I(G’) is the set of all conditional independence statements that hold for G’
In other words: G doesn’t make any assertions that are not true about G’
Note: fully connected graph is an I-map for all distributions
G is a minimal I-map of G’ if:
G is an I-map of G’
There is no G’’ G which is an I-map of G’
Conditional independence properties
How to determine if from a graph G?
Easy for undirected graphs
Kind of complicated for DAGs (Bayesian Nets)
P(A,B,C) = P(A)P(C|A)P(B|C)
D-separation
Definitions:
An undirected path P is D-separated iff at least one of the following conditions shows up:
P contains a chain S -> M -> T or S <- M <- T, where M is given/observed (with evidence)
P contains a fork S <- M -> T, where M is given/observed (with evidence)
P contains a v-structure S -> M <- T, where M and it descendents are not given/observed
Path cases
D-separation
A set of nodes A is D-separated from a set of nodes B given a third set of nodes E, iff each undirected path from every node in A to every node in B is D-separated given nodes in E
Finally, define the conditional independence (CI) properties of a DAG as follows:
Conditioned on that M1 is with evidence “and M2 is without evidence”, {S1, S2} is independent from {T1, T2}
Bayes ball algorithm
Simple way to check if A is d-separated from B given E
Shade in all nodes in E
Place “balls” in each node in A and let them “bounce around” according to some rules
Note: balls can travel in either direction
Check if any balls from A reach nodes in B
If there exists one ball that can reach from A to B, then A and B are not d-separated given E
Bayes ball rules
Explaining away (inter-causal reasoning)
Example: Toss two coins and observe their sum
Are Gas and Radio independent? Given Battery? Ignition? Starts? Moves?
Other Independence properties
1. Ordered Markov Property
2. Directed local Markov property
3. D-separation (we saw this already)
Less Obvious:
Easy to see:
nd(t) = “non-descendants” of t
Markov blanket (observed)
Definition:
The smallest set of “observed” nodes that renders a node t conditionally independent of all the other nodes in the graph.
Markov blanket in DAG is:
Co-parents (other nodes that are also parents of the children)
Markov blanket (observed)
Each node is conditionally independent of all others given its Markov blanket:
parents + children + children’s parents
X ⊥ Y |Z ⇐⇒ P (X,Y |Z) = P (X|Z)P (Y |Z)
xt+1 ⊥ x1:t−1|xt
P (x1, x2, x3, . . . , xn)
= P (x1)P (x2|x1)P (x3|x1, x2) . . . P (xn|x1, x2, x3, . . . , xn−1)
= P (x1)P (x2|x1)P (x3|x2) . . . P (xn|xn−1)
P (x1:5) = P (x1)P (x2|x1)P (x3|x1,x2)P (x4|x1, x2, x3)p(x5|x1,x2, x3,x4)
= P (x1)P (x2|x1)P (x3|x1)P (x4|x2, x3)p(x5|x3)
X1 X2 X3 X4
P (y, x1:D) = P (y)
P (x1:N ) = P (x1)
P (xi|xi−1)
x1 x2 x3 x4
P (x1:N ) = P (x1, x2)
P (xi|xi−1, xi−2)
P (x1:N ) = P (z1)P (x1|z1)
P (zi|zi−1)P (xi|zi)
Intubation
Disconnect Vent Tube
v1 v2 v3 v4 v5
P (x1:V |✓)
P (xh|xv, ✓)
XA ?G XB |XC
XA ?G XB |XE () A is d-seperated from B given E
P (x, z|y) =
P (x)P (z)P (y|x, z)
P (x, z) = P (x)P (z)
=) x 6? z|y
t ⊥ nd(t)− pa(t)|pa(t)
t ⊥ pred(t)− pa(t)|pa(t)
1 =⇒ 2 =⇒ 3
3 =⇒ 2 =⇒ 1
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com