Graphical Models
Bayesian Networks
Objectives
Copyright By PowCoder代写 加微信 powcoder
Describe Bayesian Illustrate key tasks in Networks implementing
Bayesian Networks
Why do we use graphical models?
|In machine learning, we are often concerned with joint distributions of many random variables.
|A graph may provide an intuitive way of representing or visualizing the relationships of the variables.
-Making it easier for domain experts to build a model Lung cancer X- history
Bronchitis Fatigue
Graphical Models for Casual Relations
|Graphical models arise naturally from, often causal, independency relations of physical events.
Smoking history
Lung cancer X- Fatigue
|Caveat: probabilistic relationship does not imply causality.
Bayesian Networks
|A BN is directed acyclic graph (DAG), where
-Nodes (vertices) represent random variables.
-Directed edges represent immediate dependence of nodes.
|Other names: Belief networks, Bayes nets, etc.
Conditional Independence
|E.g., given the following graph, check the relationship between X3 and X1
-X3 is dependent of X2, and X2 is dependent of X1 -Thus X3 is dependent of X1
– But given X2, X3 is dependent of X1
➔Conditional Independence
BN for General Conditional Dependency
|A BN can be used to model given conditional dependencies
-For example, using the chain rule of probability, we have P(A,B,C,D,E)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)P(E|A,B,C,D)
|If we know that, given A, C won’t rely on B, and so forth, we may have
P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B)P(E|B,D)
➢We could represent joint distributions more compactly in BN → Efficient computation
Inference in Bayesian Networks
|Given a model and some data (“evidence”), how to update our belief?
|What are the model parameters?
Inference in Bayesian Networks (cont’d)
|Given a model and some data (“evidence”), how to update our belief?
|E.g., for a patient with certain smoking history (non-smoker), whose X-ray result is positive, and who does not experience fatigue:
-What is probability of having lung cancer?
Inference in Bayesian Networks (cont’d)
|In a simple BN like this, we can compute the exact probabilities.
|In general, for a tree-structured BN, we may use belief propagation for the inference problem.
|For general structures, sometimes it is possible to generalize the above method (e.g., the junction tree algorithm). More often, we must resort to approximation methods
-E.g. Variational methods, Sampling (Monte Carlo) methods.
Learning in Bayesian Networks
|Learning parameters (probabilities) for a given BN (the graph is given).
-Estimate the (conditional) probabilities from past data. |Learning both the structure and the parameters
-A more challenging task beyond the scope of this discussion.
Learning the Probabilities
|Basic ideas
– Use relative frequency for estimating probability.
– A prior distribution is typically assumed.
– The prior is then updated by the data into posterior. – Using the MLE principle
|The so-called “Expectation-Maximization (EM) Algorithm” is often used.
-Iteratively update our guess for the parameter and each step attempts to apply the MLE principle.
Graphical Models
Hidden Markov Formulation
Objectives
Introduce Hidden Markov Illustrate HMM with Models intuitive examples
Hidden Markov Models
|Hidden Markov Models (HMMs) are a type of dynamic Bayesian Network
-Modeling a process indexed by time
|“Hidden”: the observations are due to some underlying (hidden) states not directly observable.
|“Markov”: the state transitions are governed by a Markov process.
Discrete Markov Process
|Consider a system which may be described at any time as being in one of a set of N distinct states, S1, …, SN.
| At time instances t=1,2,3, …, the system changes its state according to certain probability. The full description requires us to know P(st=Sj | st-1=Si,st- 2=Sk, …, s1=Sm) for all t, i, k, …, m, where st stands for the state of the system at time t.
– For a first-order Markov chain, we need to consider only P(st=Sj | st-1=Si)
– Further assume Ps are “stationary”:
aij = P(st=Sj | st-1=Si), 1≤i,j≤N, for any t.
A Simple Example
|Assume one of the three states for each day: S1-rainy, S2-cloudy, S3-sunny
|Assume the transition probability matrix
0.5 0.2 0.3 0.3 0.3 0.4 0.1 0.2 0.7
|Many questions we may ask, based on this model.
-E.g., Given today is cloudy, what is the probability it remains to be cloudy for next 5 days?
A = {aij} =
Extending to “Hidden” States
|The previous example is an “observable” Markov model: the output of the system/process is the states of interest.
|Now assume that we can only measure the (average) temperature of a day
-Further assume this measurement is useful for predicting the weather states (rainy, cloudy, sunny).
-We can view the temperature values as being produced by the hidden states of interest, i.e., the weather.
A Simple HMM
A Specific Process from the Model
rainy cloudy cloudy sunny cloudy
70◦ 72◦ 75◦ 78◦ 77◦
Specifying an HMM
|Θ: the set of hidden states.
|The state transition probabilities aij = P(st=Sj | st-
1=Si), 1≤i,j≤N
– Let A={aij } be the transition probability matrix
|Ω: the set of outputs (observations).
Specifying an HMM (cont’d)
|The observation probabilities: P(ot|st ), where ot stands for the observation at time t, given the state st. This is also called the emission probability.
– For discrete observation space, we can define B={bjk}=P(ot=vk at t|st=Sj ) as the emission probability matrix, where vk is the kth symbol in Ω
|The initial state distribution π = {πi}, πi=P(s1=Si)
– Sometimes we are given an initial state, i.e., P(s1=Si)=1 for certain i.
Basic Problems in HMM
|For a given HMM Λ={Θ,Ω,A,B,π}
-Problem 1: Given an observation (sequence) O={o1,o2, … ,ok}, what is the most likely state sequence S={s1,s2, … ,sk} that has produced O?
-Problem 2: How likely is an observation O (i.e., what is P(O)) ?
-Problem 3: How to estimate the model parameters (A,B,π)?
Graphical Models
Hidden Markov Models: Learning & Inference
Implement HMM learning & inference algorithms
Basic Problems in HMM
|For a given HMM Λ={Θ,Ω,A,B,π}
-Problem 1: Given an observation (sequence) O={o1,o2, … ,ok}, what is the most likely state sequence S={s1,s2, … ,sk} that has produced O?
-Problem 2: How likely is an observation O (i.e., what is P(O)) ?
-Problem 3: How to estimate the model parameters (A,B,π)?
Problem 1: State Estimation
|Given an observation (sequence) O={o1,o2, … ,ok}, what is the most likely state sequence S={s1,s2, … ,sk} that has produced O?
|Formally, we need to solve argmax𝑃(𝑺|𝑶)
|Or, equivalently,
argmax 𝑃(𝑺, 𝑶) = argmax𝑃(𝑺, 𝑶) 𝑺 𝑃(𝑶) 𝑺
Problem 1: State Estimation (cont’d)
|For a given HMM, we may simplify P(S,O) as 𝑃(𝑺, 𝑶) = 𝑃(𝑶|𝑺)𝑃(𝑺)
= 𝑃(𝑜1…𝑜𝑘|𝑠1…𝑠𝑘)ෑ𝑃(𝑠𝑗|𝑠1…𝑠𝑗−1)
≃ 𝑃(𝑜1. . . 𝑜𝑘|𝑠1. . . 𝑠𝑘) ෑ 𝑃(𝑠𝑗|𝑠𝑗−1)
= ෑ 𝑃(𝑜𝑖|𝑜1. . . 𝑜𝑖−1, 𝑠1. . . 𝑠𝑖) ෑ 𝑃(𝑠𝑗|𝑠𝑗−1)
𝑖=1 𝑗=1 𝑘𝑘𝑘
≃ ෑ 𝑃(𝑜𝑖|𝑠𝑖) ෑ 𝑃(𝑠𝑗|𝑠𝑗−1) = ෑ 𝑃(𝑜𝑖|𝑠𝑖) 𝑃(𝑠𝑖|𝑠𝑖−1) 𝑖=1 𝑗=1 𝑖=1
The ”Weather” Example
|Let’s expand the state space as a trellis, for the earlier example:
— t(.|.) is the transition probability and e(.|.) the emission probability.
S1-rain, S2-cloudy, S3- ➔To identify a path for which
the product of t’s and the e’s is maximized.
S1 S1 … S1 … S1
Initial state
S* S2 S2 S2 S2
… S3 o1o2 oi ok
Viterbi Algorithm for Problem 1
|A dynamic programming solution − For each state in the trellis, we record:
1. 𝛿𝑠𝑖(𝑡) is the probability of taking the maximal path up to time t-1 ending at state Si at time t and while generating o1…ot
2. 𝜓𝑠𝑖 (𝑡) is the state sequence that resulted in the maximal probability up to state Si at time t.
Viterbi Algorithm (cont’d)
|Initialization |Induction:
for 2≤t≤k, do |Termination:
𝛿 (1) = 𝑡(𝑆 |𝑠∗)𝑒(𝑜1|𝑆 ), ∀𝑆 ∈ Θ 𝑆𝑖 𝑖 𝑖 𝑖
𝛿 (𝑡)=max𝑡(𝑆|𝑆)𝑒(𝑜𝑡|𝑆)𝛿 (𝑡−1) 𝑆𝑖 𝑆𝑗 𝑖𝑗 𝑖𝑆𝑗
𝜓 𝑡 =argmax𝑡(𝑆|𝑆)𝑒(𝑜𝑡|𝑆)𝛿 (𝑡−1) 𝑆𝑖 𝑆 𝑖𝑗 𝑖𝑆𝑗
-The probability of the best state sequence:
-The best last state:
𝑠Ƹ𝑘 = argmax𝛿 (𝑘)
-Back trace to get other states:
𝑠Ƹ𝑡 =𝜓𝑠Ƹ𝑡+1(𝑡), for𝑡=𝑘−1,…,1.
Problem 2: Evaluate P(O)
|To evaluate P(O), we can do 𝑷(𝑶) = 𝑷(𝑺, 𝑶) 𝑺
|From the trellis, a solution can be found by summing the probabilities of all paths generating the given observation sequence.
|A dynamic programming solution: the forward algorithm or the backward algorithm.
The Forward Algorithm
|Define the forward probability , which is the probability for all paths up to time t-1 ending at state Si at time t and generating o1…ot .
Problem 3: Parameter Learning
|Case 1: we have a set of labeled data – sequences in which we have the
− Use relative frequency for estimating the probabilities →the MLE solution
numberof(𝑜𝑡 =𝑜 ,𝑠𝑡 =𝑆) 𝑟 𝑗
numberof𝑆 𝑗𝑗
|Case 2: we have only the observation sequence
-The Forward-Backward Algorithm (a.k.a. Baum- ): An EM approach.
numberof(𝑠𝑡 =𝑆,𝑠𝑡−1 =𝑆)
𝑡(𝑆 |𝑆 ) = 𝑖 𝑗
𝑒(𝑜 |𝑆 ) = 𝑟 𝑗
𝑖 𝑗 numberof𝑆
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com