COMP 424 – Artificial Intelligence Learning Bayesian Networks
Instructor: Jackie CK Cheung and Readings: R&N Ch 20 – 20.2
Review: Inference in Bayes nets
Copyright By PowCoder代写 加微信 powcoder
COMP-424: Artificial intelligence 2
Constructing Belief Nets: CPDs
P(A| T, F)
Where do these numbers come from?
COMP-424: Artificial intelligence 3
Parameter Estimation
Option 1: Ask an expert
• Use experts to select Bayes net structure and parameters
• Experts are often scarce and expensive.
• Experts can be inconsistent.
• Experts can be non-existent!
Option 2: Estimate parameters from data
• Estimate how the world works by observing samples!
COMP-424: Artificial intelligence 4
• Learning with complete data
• Supervised learning
• Maximum likelihood estimation
• Laplace smoothing
• Learning with incomplete data
• Expectation maximization
COMP-424: Artificial intelligence 5
Learning in Bayesian networks
• Given data in the form of instances:
TamperingFire Smoke Alarm Leaving Report
No No No No No No
No Yes Yes Yes Yes No ……………
• Create a complete Bayes net!
1. Parameter estimation: Given a graph structure, compute the
Conditional Probability Distributions (CPDs). Do this today!
2. Structure learning: Figure out the graph structure as well as the
numbers in the CPDs. Much harder to do!
COMP-424: Artificial intelligence
Parameter estimation with complete data
• A Bayes network structure G
• A choice of representation for the CPDs: P( Xi | Parents(Xi) )
• Learn the CPD in each node, such that the network is “closest” to
the probability distributions that generated the data.
• For simplicity, assume all random variables in graph are binary.
COMP-424: Artificial intelligence 7
Solving a Detective Mystery
• Your friend just flipped a coin 10 times:
H, T, H, H, H, T, H, H, H, T (7 heads out of 10 tosses)
• They then mixed the coin up with a bunch of other coins. You know the biases of the coins. Which one did they flip?
COMP-424: Artificial intelligence 8
Coin toss example
• Your friend just flipped a coin 10 times:
H, T, H, H, H, T, H, H, H, T (7 heads out of 10 tosses)
• Which coin did they flip? The one with P(H) = • 0.2
• 0.5 • 0.7 • 0.9
• Which of these values is possible? Probable? Likely?
COMP-424: Artificial intelligence 9
A network with one node
• Can model as a one-node Bayesian network, X={head, tail}. X
• Let P(X) = θ be the (unknown) probability of landing on head.
• In this case, X is a Bernoulli random variable.
• Given sequence of tosses x1, …, xm, how can we estimate
COMP-424: Artificial intelligence 10
Statistical parameter fitting
• Given instances x1, …, xm that are independently identically distributed (i.i.d.).
• Set of possible values for each variable in each instance is known.
• Each instance is obtained independently of the other instances.
• Each instance is sampled from the same distribution.
• The learning problem:
Find a set of parameters θ such that the data can be summarized by a probability P(xj | θ)
• θ depends on the family of probability distributions we consider (e.g. Bernoulli: θ={p}, Gaussian: θ={μ, σ2}, etc.).
COMP-424: Artificial intelligence 11
How good is a parameter set?
• It depends on how likely it is to generate the observed data.
• Let D be the data set (all the instances).
• The likelihood of parameter set θ given data set D is defined as:
L(θ | D) = P(D | θ)
• If the instances are i.i.d. we have:
L(θ | D) = P(D | θ) = P( x1, x2, …, xm | θ) = ∏j=1:m P(xj | θ) COMP-424: Artificial intelligence
Example: Coin tossing
• Suppose you see the following data: D = H, T, H, T, T Recall θ = Pr (Coin lands on Head)
• What is the likelihood of this sequence for parameter θ ? L(θ | D) = θ (1 – θ) θ (1 – θ) (1 – θ)
• Likelihood has a familiar form: L(θ | D) = θN(H) (1 – θ)N(T)
where N(H) and N(T) are numbers of heads and tails observed.
COMP-424: Artificial intelligence 13
Sufficient statistics
• To compute the likelihood in the coin tossing example, we only need to know N(H) and N(T) (number of heads and tails), not the full dataset.
• Here N(H) and N(T) are sufficient statistics for this probabilistic model (Bernoulli distribution).
• A sufficient statistic of the data is a function of the data that summarizes enough information to compute the likelihood.
• Formally, s(D) is a sufficient statistic if, for any two datasets D and D’:
s(D) = s(D’)⇒ L(θ|D) = L(θ|D’)
COMP-424: Artificial intelligence 14
Maximum likelihood estimation (MLE)
• Choose parameters that maximize the likelihood function.
• We want to maximize:
L(θ | D) = ∏j=1:m P(xj | θ)
• This is a product and products are hard to maximize!
• Instead, we can maximize:
log L(θ | D) = ∑j=1:m log P(xj | θ)
• To maximize, take the derivates of this function with respect to θ and set them to 0 (calculus!).
COMP-424: Artificial intelligence 15
MLE applied to the Bernoulli model
COMP-424: Artificial intelligence 16
MLE applied to a categorical distribution
COMP-424: Artificial intelligence
Parameter estimation in a Bayes net
• Instances are of the form:
xj =
• What parameters are we trying to estimate?
θ = { P(T), P(F), P(A|T,F), P(A|T,¬F), P(A|¬T,F), P(A|¬T,¬F), P(S|F), P(S|¬F), P(L|A), P(L|¬A), P(R|L), P(R|¬L) }
i.e., Set of parameters in the conditional probability distributions
COMP-424: Artificial intelligence
Alarm example
• Given m instances, how do we compute P(A) ? P(A) = (# instances with aj = 1) / m
• How do we compute P(L=1 | A=1)? P(L | A) = P(L, A) / P(A)
= (# instances lj=1 and aj = 1) / (# instances aj = 1) • How do we compute P(A=1 | T=1, F=0) ?
P(A | T, ¬F) = P(A, T, ¬F) / P(T, ¬F)
COMP-424: Artificial intelligence
Parameter estimation for general Bayes nets
• Generalizing, for any Bayes net with variables X1, …, Xn:
L(θ | D) = ∏j=1…m P ( X1(j), …, Xn(j) | θ )
= ∏j=1…m ∏i=1…n P ( Xi(j) | Parents(Xi(j)), θ )
= ∏i=1…n ∏j=1…m P ( Xi(j) | Parents(Xi(j)), θi )
= ∏i=1…n L(θi | D)
from i.i.d. factorization
simplification
• The likelihood function decomposes according to the structure of the network, which creates independent estimation problems.
COMP-424: Artificial intelligence 20
Coin tossing revisited
• Suppose you observed 3 coin tosses, and all come up with tails.
• What is the maximum predictor for θ ?
• Is this a good prediction?
COMP-424: Artificial intelligence 21
A problem: Zero probabilities
• For problems with lots of variables, it is possible that not all possible values are seen in the data.
• Especially for very rare events.
• What is the MLE for the corresponding parameters?
E.g. Prob(Heads) after seeing Tails times.
• If a value is not seen, the corresponding MLE value for its parameters is 0.
COMP-424: Artificial intelligence
Laplace smoothing
• Instead of: θ = N(H) / ( N(H) + N(T) )
• Use: θ = ( N(H) + 1 ) / ( N(H) + N(T) + 2 )
• Imagine that you have seen at least 1 instance of each type.
• If you have no data, this estimate is: θ = 0.5
• With 3 tails, the estimate is: θ = 0.2
• With 98 tails, the estimate is: θ = 0.01
+1 for Heads +1 for Tails
• If θ is not a Bernoulli, the “+2” changes, e.g. for a categorical with k possible outcomes, add +k in the denominator (and +1 for each possible outcome).
COMP-424: Artificial intelligence 23
What is the MLE of this Bayes net?
The estimates after Laplace smoothing?
COMP-424: Artificial intelligence 24
Answers: MLE
P(C) = 3/5
P(S|C) = 2/3 P(S|~C) = 1/2 P(R|C) = 1/3 P(R|~C) = 0/2 P(W|S,R) = 1/1 P(W|S,~R) = 2/2 P(W|~S,R) = 0/0 P(W|~S,~R) = 0/2
COMP-424: Artificial intelligence 25
Answers: Laplace Smoothing
P(C) = 4/7
P(S|C) = 3/5 P(S|~C) = 2/4 P(R|C) = 2/5 P(R|~C) = 1/4 P(W|S,R) = 2/3 P(W|S,~R) = 3/4 P(W|~S,R) = 1/2 P(W|~S,~R) = 1/4
COMP-424: Artificial intelligence 26
An Intuitive Analogy
• A Bayes net structure is like having a machine with many dials, which correspond to its parameters
COMP-424: Artificial intelligence 27
• We know that this machine was used to generate some observed samples, but we don’t know what the dial settings were. Learning is to figure out the setting.
COMP-424: Artificial intelligence
MLE vs Laplace
• MLE and Laplace are two alternative criteria to select the dial setting. MLE: pick θ to maximize P(D|θ)
Laplace: also care about generalization Data
COMP-424: Artificial intelligence
Summary of learning in Bayes Nets
• Learning is process of acquiring a model from a data set.
• In Bayes nets, we can learn the parameters of the
networks (numbers in the CPDs) or its structure.
• The maximum likelihood principle says that parameters should make the observed data as likely as possible.
• Parameters can be found by taking the gradient of the likelihood function and setting it to 0.
• In the case of simple distributions, the solution is to use “empirical probabilities”, based on counting the data.
COMP-424: Artificial intelligence 30
Summary of parameter estimation
• i.i.d. assumption
• Sufficient statistics
• Computing the maximum likelihood
• Laplace smoothing
• Extension to standard probability distributions (see textbook):
• e.g., categorical, Gaussian, Poisson, exponential
COMP-424: Artificial intelligence 31
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com