L17 – Reasoning with Continuous Variables
EECS 391
Intro to AI
Reasoning with Continuous Variables
L17 Tue Nov 6
How do you model a world?
How to you reason about it?
Bayesian inference for continuous variables
• The simplest case is true or false propositions
• Can easily extend to categorical variables
• The probability calculus is the same for continuous variables
An example with distributions: coin flipping
• In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with
probability 1 − θ.
• The binomial distribution specifies the probability of the total # of heads, y, out of n
trials:
p(y|θ, n) =
!
n
y
”
θy(1 − θ)n−y
0 5 10 15
0
0.05
0.1
0.15
0.2
0.25
y
p(
y|
θ
=0
.5
, n
=1
0)
The binomial distribution
• In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with
probability 1 − θ.
• The binomial distribution specifies the probability of the total # of heads, y, out of n
trials:
p(y|θ, n) =
!
n
y
”
θy(1 − θ)n−y
0 5 10 15
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
y
p(
y|
θ
=0
.2
5,
n
=1
0)
0 5 10 15
0
0.05
0.1
0.15
0.2
0.25
y
p(
y|
θ
=0
.2
5,
n
=2
0)
The binomial distribution
• In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with
probability 1 − θ.
• The binomial distribution specifies the probability of the total # of heads, y, out of n
trials:
p(y|θ, n) =
!
n
y
”
θy(1 − θ)n−y
How do we determine θ
from a set of trials?
Applying Bayes’ rule
• Given n trials with k heads, what do we know about θ?
• We can apply Bayes’ rule to see how our knowledge changes as we acquire new
observations:
p(θ|y, n) =
p(y|θ, n)p(θ|n)
p(y|n)
posterior
likelihood prior
normalizing
constant
● Uniform on [0, 1] is a reasonable assumption, i.e. “we don’t know anything”.
● We know the likelihood, what about the prior?
=
!
p(y|θ, n)p(θ|n)dθ
p(θ|y, n) ∝
!
n
y
”
θy(1 − θ)n−y
● In this case, the posterior is just proportional to the likelihood:
● What is the form of the posterior?
Bayesian inference with continuous variables
p(θ|y, n) =
p(y|θ, n)p(θ|n)
p(y|n)
posterior
likelihood prior
normalizing constant
=
!
p(y|θ, n)p(θ|n)dθ
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
1,
n
=5
)
1 2 3 4 5 6
y
p(
y|
θ
=0
.0
5,
n
=5
)
1 2 3 4 5 6
y
p(
y|
θ
=0
.2
, n
=5
)
1 2 3 4 5 6
y
p(
y|
θ
=0
.3
5,
n
=5
)
1 2 3 4 5 6
y
p(
y|
θ
=0
.5
, n
=5
)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
0,
n
=0
)
prior (uniform)
likelihood
(Binomial)
posterior
(beta)
p(θ|y, n) ∝
!
n
y
”
θy(1 − θ)n−y
Updating our knowledge with new information
• Now we can evaluate the poster just by plugging in different values of y and n.
p(θ|y, n) ∝
!
n
y
”
θy(1 − θ)n−y
● Check: What goes on the axes?
Evaluating the posterior
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
0,
n
=0
)
● What do we know initially, before observing any trials?
Coin tossing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
0,
n
=1
)
● What is our belief about θ after observing one “tail” ? How would you bet?
Is the p(θ >0.5) less or greater than 0.5?
What about p(θ >0.3)?
Coin tossing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
1,
n
=2
)
● Now after two trials we observe 1 head and 1 tail.
Coin tossing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
1,
n
=3
)
● 3 trials: 1 head and 2 tails.
Coin tossing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
1,
n
=4
)
● 4 trials: 1 head and 3 tails.
Coin tossing
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
1,
n
=5
)
● 5 trials: 1 head and 4 tails. Do we have good evidence that this coin is biased?
How would you quantify this statement?
p(θ > 0.5) =
!
1.0
0.5
p(θ|y, n)dθ
Can we substitute the expression above?
No! It’s not normalized.
Evaluating the normalizing constant
• To get proper probability density functions, we need to evaluate p(y|n):
p(θ|y, n) =
p(y|θ, n)p(θ|n)
p(y|n)
● Bayes in his original paper in 1763 showed that:
p(y|n) =
!
1
0
p(y|θ, n)p(θ|n)dθ
=
1
n + 1
⇒ p(θ|y, n) =
!
n
y
”
θy(1 − θ)n−y(n + 1)
More coin tossing
• After 50 trials: 17 heads and 33 tails.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
17
, n
=5
0)
What’s a good estimate of θ?
● There are many possibilities.
A ratio estimate
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
17
, n
=5
0)
● Intuitive estimate: just take ratio θ = 17/50 = 0.34
y/n = 0.34
Estimates for parameter values
• maximum likelihood estimate (MLE)
– derive by taking derivative of likelihood, setting result to zero, and solving
– ignores prior (or assumes uniform prior)
– (derived on board)
• Maximum a posteriori (MAP)
– derive by taking derivative of posterior, setting result to zero, and solving
�L
�
=
�
n
y
�
�y(1 � �)n�y = 0
�ML =
y
n
The maximum a posteriori (MAP) estimate
• This just picks the location of maximum value of the posterior
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
17
, n
=5
0)
● In this case, maximum is also at θ = 0.34.
MAP estimate = 0.34
A different case
• What about after just one trial: 0 heads and 1 tail?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
0,
n
=1
)
● MAP and ratio estimate would say 0.
y/n = 0
*
Does this make sense?
● What would a better estimate be?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
0,
n
=1
)
The expected value estimate
• We defined the expected value of a pdf in the previous lecture:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
0,
n
=1
)
E(θ|y, n) =
!
1
0
θp(θ|y, n)dθ
=
y + 1
n + 2
What happens for zero trials?
E(θ|y = 0, n = 1) =
1
3
Much more coin tossing
• After 500 trials: 184 heads and 316 tails.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
18
4,
n
=5
00
)
What’s your guess of θ?
Much more coin tossing
• After 5000 trials: 1948 heads and 3052 tails.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ
p(
θ
|
y=
19
48
, n
=5
00
0)
True value is 0.4.
● Posterior contains true estimate. Is this always the case?
NO! Only if the
assumptions are
correct.
How could our assumptions be wrong?
Laplace’s example: proportion female births
• A total of 241,945 girls and 251,527 boys were born in Paris from 1745-1770.
• Laplace was able to evaluate the following
0.484 0.486 0.488 0.49 0.492 0.494 0.496 0.498
θ
p(
θ
|
y=
24
19
45
, n
=4
93
47
2)
p(θ > 0.5) =
!
1.0
0.5
p(θ|y, n)dθ ≈ 1.15 × 10−42
He was “morally certain” θ < 0.5. But could he have been wrong? Laplace and the mass of Saturn • Laplace used “Bayesian” inference to estimate the mass of Saturn and other planets. For Saturn he said: It is a bet of 11000 to 1 that the error in this result is not within 1/100th of its value Mass of Saturn as a fraction of the mass of the Sun Laplace (1815) NASA (2004) 3512 3499.1 (3512 - 3499.1) / 3499.1 = 0.0037 Laplace is still wining. Applying Bayes’ rule with an informative prior • What if we already know something about θ? • We can still apply Bayes’ rule to see how our knowledge changes as we acquire new observations: p(θ|y, n) = p(y|θ, n)p(θ|n) p(y|n) ● Assume we know biased coins are never below 0.3 or above 0.7. ● But now the prior becomes important. ● To describe this we can use a beta distribution for the prior. A beta prior 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =0 ) ● In this case, before observing any trials our prior is not uniform: Beta(a=20,b=20) Coin tossing revisited 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) ● What is our belief about θ after observing one “tail” ? ● With a uniform prior it was: What will it look like with our prior? Coin tossing with prior knowledge 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) ● Our belief about θ after observing one “tail” hardly changes. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 17 , n =5 0) ● After 50 trials, it’s much like before. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 19 48 , n =5 00 0) ● After 5,000 trials, it’s virtually identical to the uniform prior. What did we gain?