COMP9418: Advanced Topics in Statistical Machine Learning
Approximate Inference by Sampling
Instructor: University of Wales
Copyright By PowCoder代写 加微信 powcoder
§ Stochastic sampling is a method for estimating probabilities
§ It works by measuring the frequency of events according to a simulation
§ We can efficiently simulate situations according to their probability of occurrence by operating on the corresponding network
§ Basic idea
§ Draw 𝑛 samples from a sampling distribution 𝑃’ § Compute an approximate probability
§ Show this converges to the true probability 𝑃
§ Why sample?
§ Inference: getting a sample is faster than computing the right answer
Monte Carlo Simulation
§ We first simulate a random sample 𝒙$,…,𝒙% fromtheunderlyingdistribution 𝑃(𝑿)
§ Evaluate a function at each instantiation of the sample, 𝒙!, … , 𝒙”
§ Compute the arithmetic average known as the sample mean
§ We can also compute the sample variance
1! 𝐴𝑣!(𝑓) ≝ 𝑛-𝑓(𝒙”)
1 ! 𝑆!%(𝑓)≝𝑛−1- 𝑓(𝒙”)−𝐴𝑣! 𝑓
§ Sampling from given distribution
§ Step 1: Get sample u from uniform distribution over [0, 1)
§ E.g. random() in python
§ Step 2: Convert this sample u into an outcome for the given distribution by having each outcome associated with a sub-interval of [0,1) with sub-interval size equal to probability of the outcome
§ If random() returns u = 0.83, then our sample is C = blue
§ E.g, after sampling 8 times:
§ Forward Sampling
§ Rejection Sampling
§ Likelihood Weighting §
Forward Sampling
Cloudy Sprinkler
Simulate Bayesian Network: Simulate_BN
Input: Network 𝑁 with variables 𝑿 inducing a distribution 𝑃 Output: one instance Σ
𝜋 ← topological order of the nodes in 𝑁
for 𝑖 = 1 𝐭𝐨 𝑛 do
𝑋 ← variable at position 𝑖 in order 𝜋
𝒖 ← value of 𝑋’s parents in instantiation Σ
𝑥 ← value of 𝑋 sampled according to 𝑃(𝑋|𝒖) Σ ← Σ, 𝑥
# Trivial instantiation
# Network has 𝑛 variables
# Append value 𝑥 to Σ
Simulate Bayesian Network
Input: event 𝛼, sample size 𝑛, Network 𝑁 Output: sample mean for 𝑓!
for 𝑖 = 1 𝐭𝐨 𝑛 do
𝒙” ← Simulate_BN(𝑁)
𝑝 ← 𝑝 + 𝑓! ( 𝒙 ” ) return 𝑝/𝑛
# Simulate the Bayesian network
§ Get several samples from the Bayesian network: 𝑐,𝑠̅,𝑟,𝑤
𝑐,𝑠,𝑟,𝑤 𝑐 ̅ , 𝑠 , 𝑟 , 𝑤7 𝑐,𝑠̅,𝑟,𝑤 𝑐̅, 𝑠̅, 𝑟̅, 𝑤
§ If we want to know 𝑃(𝑊)
§ We have counts < 𝑤:4,𝑤7:1 >
§ Normalize to get 𝑃(𝑊) = < 𝑤: 0.8, 𝑤7: 0.2 >
§ This will get closer to the true distribution with more samples
§ Can estimate anything else, too
§ What about 𝑃(𝐶|𝑤7)? 𝑃(𝐶| 𝑟, 𝑤)? 𝑃(𝐶|𝑟̅, 𝑤7)?
§ Fast: can use fewer samples if less time (what’s the drawback?)
How Many Samples?
§ Bounds for independent sampling, for an estimator
§Chebyshevbound:𝑃 𝐴𝑣” 𝑓P −𝑃 𝛼 ≤𝜖 ≥1−Q P Q(PR) “S!
How Many Samples?
§ Bounds for independent sampling, for an estimator §Hoeffdingbound:𝑃 𝐴𝑣” 𝑓P −𝑃 𝛼 ≤𝜖 ≥1−2𝑒TU”S!
How Many Samples?
§ We can define the relative error as
§Chebyshevbound:𝑃 VW” X# TQ(P) ≤𝜖 ≥1− Q PR Q(P) “S!Q P
𝐴𝑣” 𝑓P −𝑃(𝛼) 𝑃(𝛼)
Estimating a Conditional Probability
§ Consider now the problem of estimating a conditional probability P(𝛼|𝛽)
§ With a distribution 𝑃(. ) induced by a Bayesian network
§ However, sampling from the distribution P(. |𝛽) is generally hard
§ We typically cannot efficiently generate a sequence of independent instantiations 𝒙.,…,𝒙/
§ Where the probability of generating instantiation 𝒙” is 𝑃(𝒙”|𝛽)
Rejection Sampling
§ Suppose say we want 𝑃(𝐶|𝑟)
§ We can compute this using the Bayes
conditioning
§ Count 𝐶 outcomes, but ignore (reject)
samples which do not have 𝑅 = 𝑟
§ This is called rejection sampling
§ It is essentially, the same as forward sampling
𝑃𝛼𝛽 =𝑃𝛼,𝛽 𝑃(𝛽)
𝑐,𝑠̅,𝑟,𝑤 𝑐,𝑠,𝑟,𝑤 𝑐 ̅ , 𝑠 , 𝑟 , 𝑤7 𝑐,𝑠̅,𝑟,𝑤 𝑐̅, 𝑠̅, 𝑟̅, 𝑤
Rejection Sampling: Simulate_BN
Input: Network 𝑁 with variables 𝑿 inducing a distribution 𝑃, evidence 𝒆 Output: one instance Σ consistent with 𝒆 or ⊺
𝜋 ← topological order of the nodes in 𝑁
for 𝑖 = 1 𝐭𝐨 𝑛 do
𝑋 ← variable at position 𝑖 in order 𝜋
𝒖 ← value of 𝑋’s parents in instantiation Σ
𝑥 ← value of 𝑋 sampled according to 𝑃(𝑋|𝒖) if 𝑥 is not consistent with 𝒆
return ⊺ Σ ← Σ, 𝑥
# Trivial instantiation
# Network has 𝑛 variables
# Reject, no sample is generated # Append value 𝑥 to Σ
How many samples?
§ Rejection sampling is a form of forward sampling § Hoeffding and Chebyshev bounds still hold
§ But now 𝑛 is the number of samples kept
§Ifgoalistoestimate𝑃 𝑋𝒆
§ Expected fraction of examples kept ~ 𝑃(𝒆)
§ For large Bayesian networks and lots of evidence 𝑃(𝒆) will be tiny
§ The number of examples kept decreases exponentially with number of observed variables
Likelihood Weighting
§ Problem with rejection sampling:
§ If evidence is unlikely, rejects lots of samples § Evidence not exploited as you sample
§ Consider 𝑃(𝑆h𝑎𝑝𝑒|𝑏𝑙𝑢𝑒)
§ Idea: fix evidence variables and sample the rest
§ Problem: sample distribution not 𝑃 anymore! § Solution: weight by probability of evidence
pyramid, green pyramid, red sphere, blue cube, red sphere, green
given parents
pyramid, blue pyramid, blue sphere, blue cube, blue sphere, blue
Likelihood Weighting
Cloudy Sprinkler
Likelihood Weighting
Input: Network 𝑁 with variables 𝑿 inducing a distribution 𝑃, evidence 𝒆 ∈ 𝑬 Output: one instance Σ and associated weight 𝑤
𝜋 ← topological order of the nodes in 𝑁
for 𝑖 = 1 𝐭𝐨 𝑛 do
𝑋 ← variable at position 𝑖 in order 𝜋
𝒖 ← value of 𝑋’s parents in instantiation Σ if 𝑋 ∈ 𝑬
𝑤 ← 𝑤 × 𝑃(x|𝒖)
𝑥 ← value of 𝑋 sampled according to 𝑃(𝑋|𝒖)
Σ ← Σ, 𝑥 return Σ, w
# Trivial instantiation
# Initial weight for Σ
# Network has 𝑛 variables
# Append value 𝑥 to Σ
Likelihood Weighting
§ Likelihood weighting is good
§ We have taken evidence into account as
we generate the sample
§ E.g. here, 𝑊’s value will get picked
based on the evidence values of 𝑆, 𝑅
§ More of our samples will reflect the state of the world suggested by the evidence
§ Likelihood weighting doesn’t solve all our problems
§ Evidence influences the choice of downstream variables, but not upstream ones (𝐶 is not more likely to get a value matching the evidence)
§ We would like to consider evidence when we sample every variable
§ Gibbs sampling
§ This sampling method will allow us to sample Markov Networks
§ Procedure: keep track of a full instantiation 𝑥1, 𝑥2, … , 𝑥𝑛. Start with an instantiation consistent with the evidence. Sample one variable at a time, conditioned on all the rest, but keep evidence fixed. Keep repeating this for a long time
§ Property: in the limit of repeating this infinitely many times the resulting sample is coming from the correct distribution
§ Rationale: both upstream and downstream variables condition on evidence
§ In contrast: likelihood weighting only conditions on upstream evidence, and hence weights obtained in likelihood weighting can sometimes be very small. Sum of weights over all samples is indicative of how many “effective” samples were obtained, so we want high weights
Example: 𝑃(𝑆|𝑟)
§ Step 1: Fix evidence §𝑅=𝑟
C § Step 2: Initialize other variables C Sr Sr
§ Steps 3: Repeat
§ Choose a non-evidence variable 𝑋
§ Resample 𝑋 from 𝑃(𝑋 | all other variables)
SrSrSrSrSrSr WWWWWW
Sample from 𝑃(𝑆|𝑐, 𝑤R, 𝑟) Sample from 𝑃(𝐶|𝑠, 𝑤R, 𝑟) Sample from 𝑃(𝑊|𝑠, 𝑐, 𝑟)
Efficient Resampling of One Variable
§ Sample from 𝑃(𝑆|𝑐, 𝑟, 𝑤> )
§ Many things cancel out – only CPTs with 𝑆 remain!
§ More generally: only CPTs that have resampled variable need to be considered, and joined together
Markov Network Example
§ Sample from 𝑃(𝐴|𝑏, 𝑐, 𝑑)
and Markov Chain
§ Gibbs sampling produces sample from the query distribution 𝑃(𝑄|𝑒) in limit of re-sampling infinitely often
§ Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods
§ Gibbs sampling is a Markov chain
§ We start with a sample from an arbitrary distribution 𝑃l and approximate a stationary distribution 𝜋
Markov Chain and Sampling
§ Our goal is to compute 𝑃 𝒙
§ However, 𝑃 may be too hard to sample directly
§ For instance, we need to sample from 𝑃(𝑿|𝒆) but 𝑃(𝒆) is very small
§ We construct a Markov chain 𝑇
§ Whose unique stationary distribution is 𝑃
§ Therefore, 𝑇 should be irreducible (regular)
§ We start sampling 𝑥R from 𝑃R
§ But we cannot use these initial samples
§ Since they are not from the stationary distribution 𝜋
Markov Chain and Sampling
§ Continue sampling from the transition probability 𝑃 𝑋S 𝑋ST$
§ We want to use samples that are sampled from the distribution close to 𝑃 § But, at early iteration 𝑃m is usually far from 𝑃
§ Therefore, we need to wait for the chain to converge to the stationary distribution
§ We say that we want to start collecting samples after the chain has run long enough to “mix”
§ 𝑃m is close to the stationary distribution 𝜋
“Mixing” Chains
§ The term “mix” is used to denote that the chain is close to 𝜋 § We want to start sample only after the chain has mixed
§ However, how do we know if a chain has mixed?
§ Unfortunately, we usually do not know
§ The practice is to run several tests
§ If no evidence of a non-mixing chain is found, we assume the chain has mixed
§ How do we know a chain has not mixed?
§ Compare chain statistics in different windows within a single run of the chain § Across different runs initialized differently (recommended)
X1 X2 X3 X4 +1 Xk+2 W1 W2
Measuring Convergence
§ One popular method is to compare within (𝑊) and between (𝐵) chain variance
§𝜃̅=($⁄)∑! 𝑓(𝑥) & ! “#$’ “,&
§𝜃̅=($⁄)∑) 𝜃7 ) &#$&
§𝜎%=($⁄ )∑! (𝑓(𝑥 )−𝜃7)% & !*$ “#$ ‘ “,& &
§ 𝑊 = ( $⁄ ) ∑ ) 𝜎 % ) &#$&
§𝐵=(!⁄ )∑) (𝜃7−𝜃̅)% )*$ &#$ &
Non-mixing
§ Then we calculate
F no$⁄ (qTn) §𝑅= ”
If𝑊→𝐵or𝑛 →∞ then numerator → 𝑊
§Initially𝑅F≫1 F
§Estimate𝑅acrossseveral parameters and stop when all parameters satisfy 𝑅F < 1.1
Chain Samples
§ Once the chain mixes, all samples are from the stationary distribution § We should use all samples 𝑥𝑡 for 𝑡 > 𝑇𝑚𝑖𝑥
§ Nearby samples are correlated
§ The examples are not independent
§ The effective sample size is smaller than independent sampling
§ The faster the chain mixes, the less correlated are the samples § Slow convergence indicates we move slowly in the space
MCMC Algorithm Burn-in
Input: Network 𝑁 with variables 𝑿 inducing distribution 𝑃, evidence 𝒆 ∈ 𝑬, number of chains 𝑐 Output: instances Σ”,+
for 𝑗 = 1 𝐭𝐨 𝑐 do
Σ”,+ ← a complete instantiation of variables 𝑿 − 𝑬
while not mixed
for 𝑗 = 1 𝐭𝐨 𝑐 do
𝑋 ← a variable chosen randomly from 𝑿 − 𝑬
𝑥 ← value of 𝑋 sampled according to 𝑃(𝑋|Σ”,&,+ − 𝑋, 𝒆)
Σ”,+ ← Σ”,&,+⨁𝑥 𝑖←𝑖+1
compute convergence criterion over windows of Σ such as 𝑅x return Σ”,+ for each chain 𝑗
# Samples from 𝑃(𝑿|𝒆) if possible
# Change value 𝑥 in Σ”,&,+
MCMC Algorithm Sampling
Input: Network 𝑁 with variables 𝑿 inducing distribution 𝑃, evidence 𝒆 ∈ 𝑬, number of chains 𝑐, instances Σ+
Output: set of instances 𝐷 𝐷←∅
while not sufficient samples
for 𝑗 = 1 𝐭𝐨 𝑐 do
𝑋 ← a variable chosen randomly from 𝑿 − 𝑬
𝑥 ← value of 𝑋 sampled according to 𝑃(𝑋|Σ+ − 𝑋, 𝒆)
𝐷 ← 𝐷 ∪ Σ+ return 𝐷
# Change value 𝑥 in Σ+
§ Target distribution 𝑃(𝑿)
§ Gibbs distribution
§ It can represent equally well normalized or unnormalized factors
§ Markov chain state space
§ Complete assignments 𝒙 from 𝑿
§ The state space is exponentially large in the number of variables
§ Transition model given starting state 𝒙: § For each variable 𝑋𝑖
§ Sample 𝑥|~𝑃(𝑋||𝒙 − 𝑋|, 𝒆)
and Regularity
§ Is the Gibbs chain irreducible (regular)?
§ Not always
§ However, if all factors are positive,
P(X1,X2,Y)
Gibbs chain is regular
𝑌 = 𝑋1 XOR 𝑋2
Conclusion
§ Sampling are powerful techniques for approximate inference
§ Forward, rejection and likelihood sampling require traversing the network in topological order § Applicable to Bayesian Networks
§ MCMC works with complete states and therefore do not require node ordering § Applicable to Bayesian and Markov Networks
§ However, these techniques have limitations
§ MCMC has parameters/design choices, may have slow convergence and it is difficult to tell when the
chains mixes
§ Gibbs sampling is computational efficient but typically slow to mix
§ Read chapter 15
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com