CS代考 CSE 3521: Bayesian Networks

PowerPoint Presentation

CSE 3521: Bayesian Networks
(DAG Probabilistic Graphical Models)

[Many slides are adapted from previous CSE 5521 course at OSU.]

Probabilistic models for classification
: output the class with the largest posterior probability

Generative models
Bayes’ rules:
Classification rule:

Naïve Bayes:
Conditional independence “assumption” for

Use MLE, optimizing each term or by the corresponding data

Inference:

Probabilistic graphical models (PGMs)
An efficient way to encode “conditional independence”
From PGMs, we can decompose a joint probability much more efficiently

Probabilistic Inference on PGMs

Independence in PGMs

Problem: Dependent feature variables
Most real-world data have high-dimensional and correlated variables
Pixels in an image
Words in a document (“I”, “go”, “to”, “school”, “.”)
Genes in a microarray

Sometimes, even data instances are not independent
Need exponentially many parameters for the probabilistic model
Need more training data
Need more training and inference time

Question: How to build probability models?
How to compactly represent , where are the parameters?
How can we learn the parameters with a reasonable amount of data?
How can we use this distribution to infer a set of variables given another?
Ex. Given the first 100 pixels, infer the rest
Ex. Given today’s stock market, predict tomorrow’s

The chain rule of probability
Can represent any joint distribution this way
Using any ordering of the variables…

We will use or interchangeably sometimes in this lecture!
Problem: this distribution has O(2N) parameters if each of them is a binary random variable

Conditional Independence
This is the key to representing large joint distributions

Definition: X and Y are conditionally independent given Z if and only if …
the conditional joint distribution can be written as a product of the conditional marginal distributions

Equivalent properties:

Example: Markov models
“The future is independent of the past given the present”

Only O(N) parameters:
Fewer parameters, (1) faster learning, (2) fast inference, (3) and fewer training data required!

Probabilistic graphical models
First order Markov assumption is useful for 1-D sequence data
Sequences of words in a sentence or document

Question: What about 2-D images, 3-D video
Or in general arbitrary collections of variables
Gene pathways, etc…

Probabilistic graphical models
A way to represent a joint distribution by making conditional independence assumptions
Nodes represent variables

Edges (node relationship!): can be directed or undirected
directed acyclic graph (DAG): Bayesian networks
No edges indicate conditional independence assumptions
Ex: (top) C and B are conditionally independent given A
Ex: think of the graph as a data generation process

Better name: “conditional independence diagrams”

Graph terminology
Graph (V,E) consists of
A set of nodes or vertices V={1, …, V}
A set of edges {(s, t) in V}
Childs/Parents (for directed graphs)
Ancestors (for directed graphs)
Decedents (for directed graphs)
Neighbors (for any graphs)
Cycle (for any graphs)
Tree (no cycles)
Clique / Maximal Clique (for undirected graphs): (sub) complete graph

Directed graphical models
Graphical Model whose graph is a DAG
DAG: Directed acyclic graph (no cycles!)

A.K.A. Bayesian Networks
Also known as Bayes networks, belief networks
Nothing inherently Bayesian about them
Just a way of defining conditional independence

Directed graphical models
Key properties: Nodes can be ordered so that parents come before children
Topological ordering
Can be constructed from any DAG

Ordered Markov Property:
Generalization of first-order Markov Property to general DAGs
Node only depends on its parents (not other ancestors)

EX. D A | B, C

Decomposition (nodes are ordered):

Given the decomposition, you can draw the DAG
Given the DAG, you can derive the decomposition

Complete DAG: no conditional independence

Naïve Bayes

White nodes: unknown
Gray nodes: observed
Can generate data!

Markov models

First order Markov Model

Second order Markov Model

Hidden Markov Model

White nodes: unknown
Gray nodes: observed

Example: medical diagnosis (The alarm network)

Another medical diagnosis example: QMR network

QMR: Quick Medical Reference

Probabilistic graphical models (PGMs)
An efficient way to encode conditional independence
From PGMs, we can decompose a joint probability much efficiently

Probabilistic Inference on PGMs

Independence in PGMs

Probabilistic inference
Graphical Models provide a compact way to represent complex joint distributions

Question: Given a joint distribution, what can we do with it?
Answer: Main use = probabilistic inference
Estimate unknown variables from known ones
Ex. Given P(X, Y), predict the most likely assignment of Y given X=x

General form of inference
A correlated set of random variables
Joint distribution:
Assumption: parameters are known
Partition variables into:
Visible (with assignments/observations):
Goal: compute unknowns from known ones

General form of inference
Condition data by clamping visible variables to observed values
Normalize by probability of evidence

Nuisance variables
Partition hidden variables into:
Query Variables (interested):
Nuisance variables (not interested):

Inference on PGMs

B E A P(A|B,E)
+b +e +a 0.95
+b +e -a 0.05
+b -e +a 0.94
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999

A J P(J|A)
-a +j 0.05
-a -j 0.95

A M P(M|A)
-a +m 0.01
-a -m 0.99

Earthquake

Inference on PGMs

B E A P(A|B,E)
+b +e +a 0.95
+b +e -a 0.05
+b -e +a 0.94
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999

A J P(J|A)
-a +j 0.05
-a -j 0.95

A M P(M|A)
-a +m 0.01
-a -m 0.99

Earthquake

Inference on PGMs

B E A P(A|B,E)
+b +e +a 0.95
+b +e -a 0.05
+b -e +a 0.94
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999

A J P(J|A)
-a +j 0.05
-a -j 0.95

A M P(M|A)
-a +m 0.01
-a -m 0.99

Earthquake

** Inference vs. learning
Inference:
Parameters are assumed to be known
Learning (parameter estimation):
Compute MLE or MAP estimate of the parameters

** Bayesian learning
Parameters are treated as random variables
Update by
No explicit distinction between inference and learning

Main distinction between inference and learning:
# of hidden variables grows with the size of dataset
# of parameters is fixed-sized

Probabilistic graphical models (PGMs)
An efficient way to encode conditional independence
From PGMs, we can decompose a joint probability much efficiently

Probabilistic Inference on PGMs

Independence in PGMs

Conditional independence properties
A is independent of B given C (from the graph G)

I(G) is the set of all such conditional independence assumptions encoded by G

G is an I-map for G’ iff I(G) I(G’)
Where I(G’) is the set of all conditional independence statements that hold for G’
In other words: G doesn’t make any assertions that are not true about G’

Note: fully connected graph is an I-map for all distributions

G is a minimal I-map of G’ if:
G is an I-map of G’
There is no G’’ G which is an I-map of G’

Conditional independence properties
How to determine if from a graph G?

Easy for undirected graphs

Kind of complicated for DAGs (Bayesian Nets)

P(A,B,C) = P(A)P(C|A)P(B|C)

D-separation
Definitions:
An undirected path P is D-separated iff at least one of the following conditions shows up:
P contains a chain S -> M -> T or S <- M <- T, where M is given/observed (with evidence) P contains a fork S <- M -> T, where M is given/observed (with evidence)
P contains a v-structure S -> M <- T, where M and it descendents are not given/observed Path cases D-separation A set of nodes A is D-separated from a set of nodes B given a third set of nodes E, iff each undirected path from every node in A to every node in B is D-separated given nodes in E Finally, define the conditional independence (CI) properties of a DAG as follows: Conditioned on that M1 is with evidence “and M2 is without evidence”, {S1, S2} is independent from {T1, T2} Bayes ball algorithm Simple way to check if A is d-separated from B given E Shade in all nodes in E Place “balls” in each node in A and let them “bounce around” according to some rules Note: balls can travel in either direction Check if any balls from A reach nodes in B If there exists one ball that can reach from A to B, then A and B are not d-separated given E Bayes ball rules Explaining away (inter-causal reasoning) Example: Toss two coins and observe their sum Are Gas and Radio independent? Given Battery? Ignition? Starts? Moves? Other Independence properties 1. Ordered Markov Property 2. Directed local Markov property 3. D-separation (we saw this already) Less Obvious: Easy to see: nd(t) = “non-descendants” of t Markov blanket (observed) Definition: The smallest set of “observed” nodes that renders a node t conditionally independent of all the other nodes in the graph. Markov blanket in DAG is: Co-parents (other nodes that are also parents of the children) Markov blanket (observed) Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents X ⊥ Y |Z ⇐⇒ P (X,Y |Z) = P (X|Z)P (Y |Z) xt+1 ⊥ x1:t−1|xt P (x1, x2, x3, . . . , xn) = P (x1)P (x2|x1)P (x3|x1, x2) . . . P (xn|x1, x2, x3, . . . , xn−1) = P (x1)P (x2|x1)P (x3|x2) . . . P (xn|xn−1) P (x1:5) = P (x1)P (x2|x1)P (x3|x1,x2)P (x4|x1, x2, x3)p(x5|x1,x2, x3,x4) = P (x1)P (x2|x1)P (x3|x1)P (x4|x2, x3)p(x5|x3) X1 X2 X3 X4 P (y, x1:D) = P (y) P (x1:N ) = P (x1) P (xi|xi−1) x1 x2 x3 x4 P (x1:N ) = P (x1, x2) P (xi|xi−1, xi−2) P (x1:N ) = P (z1)P (x1|z1) P (zi|zi−1)P (xi|zi) Intubation Disconnect Vent Tube v1 v2 v3 v4 v5 P (x1:V |✓) P (xh|xv, ✓) XA ?G XB |XC XA ?G XB |XE () A is d-seperated from B given E P (x, z|y) = P (x)P (z)P (y|x, z) P (x, z) = P (x)P (z) =) x 6? z|y t ⊥ nd(t)− pa(t)|pa(t) t ⊥ pred(t)− pa(t)|pa(t) 1 =⇒ 2 =⇒ 3 3 =⇒ 2 =⇒ 1 /docProps/thumbnail.jpeg 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts