程序代写代做代考 Bayesian AI L17 – Reasoning with Continuous Variables

L17 – Reasoning with Continuous Variables

EECS 391
Intro to AI

Reasoning with Continuous Variables

L17 Tue Nov 6

How do you model a world? 


How to you reason about it?

Bayesian inference for continuous variables

• The simplest case is true or false propositions
• Can easily extend to categorical variables
• The probability calculus is the same for continuous variables

An example with distributions: coin flipping

• In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with
probability 1 − θ.

• The binomial distribution specifies the probability of the total # of heads, y, out of n
trials:

p(y|θ, n) =

!

n

y

θy(1 − θ)n−y

0 5 10 15
0

0.05
0.1

0.15
0.2

0.25

y

p(
y|
θ
=0

.5
, n

=1
0)

The binomial distribution

• In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with
probability 1 − θ.

• The binomial distribution specifies the probability of the total # of heads, y, out of n
trials:

p(y|θ, n) =

!

n

y

θy(1 − θ)n−y

0 5 10 15
0

0.05
0.1

0.15
0.2

0.25
0.3

0.35

y

p(
y|
θ
=0

.2
5,

n
=1

0)

0 5 10 15
0

0.05
0.1

0.15
0.2

0.25

y

p(
y|
θ
=0

.2
5,

n
=2

0)

The binomial distribution

• In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with
probability 1 − θ.

• The binomial distribution specifies the probability of the total # of heads, y, out of n
trials:

p(y|θ, n) =

!

n

y

θy(1 − θ)n−y

How do we determine θ
from a set of trials?

Applying Bayes’ rule

• Given n trials with k heads, what do we know about θ?
• We can apply Bayes’ rule to see how our knowledge changes as we acquire new

observations:

p(θ|y, n) =
p(y|θ, n)p(θ|n)

p(y|n)
posterior

likelihood prior

normalizing
constant

● Uniform on [0, 1] is a reasonable assumption, i.e. “we don’t know anything”.
● We know the likelihood, what about the prior?

=

!
p(y|θ, n)p(θ|n)dθ

p(θ|y, n) ∝

!

n

y

θy(1 − θ)n−y

● In this case, the posterior is just proportional to the likelihood:
● What is the form of the posterior?

Bayesian inference with continuous variables

p(θ|y, n) =
p(y|θ, n)p(θ|n)

p(y|n)
posterior

likelihood prior

normalizing constant
=

!
p(y|θ, n)p(θ|n)dθ

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
1,

n
=5

)

1 2 3 4 5 6
y

p(
y|
θ
=0

.0
5,

n
=5

)

1 2 3 4 5 6
y

p(
y|
θ
=0

.2
, n

=5
)

1 2 3 4 5 6
y

p(
y|
θ
=0

.3
5,

n
=5

)

1 2 3 4 5 6
y

p(
y|
θ
=0

.5
, n

=5
)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
0,

n
=0

)

prior (uniform)

likelihood
(Binomial)

posterior 

(beta)

p(θ|y, n) ∝

!

n

y

θy(1 − θ)n−y

Updating our knowledge with new information

• Now we can evaluate the poster just by plugging in different values of y and n.

p(θ|y, n) ∝

!

n

y

θy(1 − θ)n−y

● Check: What goes on the axes?

Evaluating the posterior

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
0,

n
=0

)

● What do we know initially, before observing any trials?

Coin tossing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
0,

n
=1

)

● What is our belief about θ after observing one “tail” ? How would you bet?
Is the p(θ >0.5) less or greater than 0.5?

What about p(θ >0.3)?

Coin tossing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
1,

n
=2

)

● Now after two trials we observe 1 head and 1 tail.

Coin tossing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
1,

n
=3

)

● 3 trials: 1 head and 2 tails.

Coin tossing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
1,

n
=4

)

● 4 trials: 1 head and 3 tails.

Coin tossing

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
1,

n
=5

)

● 5 trials: 1 head and 4 tails. Do we have good evidence that this coin is biased?
How would you quantify this statement?

p(θ > 0.5) =

!
1.0

0.5

p(θ|y, n)dθ

Can we substitute the expression above?

No! It’s not normalized.

Evaluating the normalizing constant

• To get proper probability density functions, we need to evaluate p(y|n):

p(θ|y, n) =
p(y|θ, n)p(θ|n)

p(y|n)

● Bayes in his original paper in 1763 showed that:

p(y|n) =

!
1

0

p(y|θ, n)p(θ|n)dθ

=
1

n + 1

⇒ p(θ|y, n) =

!

n

y

θy(1 − θ)n−y(n + 1)

More coin tossing

• After 50 trials: 17 heads and 33 tails.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
17

, n
=5

0)

What’s a good estimate of θ?

● There are many possibilities.

A ratio estimate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
17

, n
=5

0)

● Intuitive estimate: just take ratio θ = 17/50 = 0.34

y/n = 0.34

Estimates for parameter values

• maximum likelihood estimate (MLE)
– derive by taking derivative of likelihood, setting result to zero, and solving 



– ignores prior (or assumes uniform prior)
– (derived on board) 


• Maximum a posteriori (MAP)
– derive by taking derivative of posterior, setting result to zero, and solving

�L

=


n

y


�y(1 � �)n�y = 0

�ML =
y

n

The maximum a posteriori (MAP) estimate

• This just picks the location of maximum value of the posterior

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
17

, n
=5

0)
● In this case, maximum is also at θ = 0.34.

MAP estimate = 0.34

A different case

• What about after just one trial: 0 heads and 1 tail?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
0,

n
=1

)
● MAP and ratio estimate would say 0.

y/n = 0

*

Does this make sense?

● What would a better estimate be?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
0,

n
=1

)

The expected value estimate

• We defined the expected value of a pdf in the previous lecture:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
0,

n
=1

)
E(θ|y, n) =

!
1

0

θp(θ|y, n)dθ

=
y + 1

n + 2

What happens for zero trials?

E(θ|y = 0, n = 1) =
1

3

Much more coin tossing

• After 500 trials: 184 heads and 316 tails.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
18

4,
n

=5
00

)

What’s your guess of θ?

Much more coin tossing

• After 5000 trials: 1948 heads and 3052 tails.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ

p(
θ
|

y=
19

48
, n

=5
00

0)

True value is 0.4.

● Posterior contains true estimate. Is this always the case?

NO! Only if the
assumptions are

correct.

How could our assumptions be wrong?

Laplace’s example: proportion female births

• A total of 241,945 girls and 251,527 boys were born in Paris from 1745-1770.
• Laplace was able to evaluate the following

0.484 0.486 0.488 0.49 0.492 0.494 0.496 0.498
θ

p(
θ
|

y=
24

19
45

, n
=4

93
47

2)
p(θ > 0.5) =

!
1.0

0.5

p(θ|y, n)dθ ≈ 1.15 × 10−42

He was “morally certain” θ < 0.5. But could he have been wrong? Laplace and the mass of Saturn • Laplace used “Bayesian” inference to estimate the mass of Saturn and other planets. For Saturn he said: It is a bet of 11000 to 1 that the error in this result is not within 1/100th of its value Mass of Saturn as a fraction of the mass of the Sun Laplace (1815) NASA (2004) 3512 3499.1 (3512 - 3499.1) / 3499.1 = 0.0037 Laplace is still wining. Applying Bayes’ rule with an informative prior • What if we already know something about θ? • We can still apply Bayes’ rule to see how our knowledge changes as we acquire new observations: p(θ|y, n) = p(y|θ, n)p(θ|n) p(y|n) ● Assume we know biased coins are never below 0.3 or above 0.7. ● But now the prior becomes important. ● To describe this we can use a beta distribution for the prior. A beta prior 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =0 ) ● In this case, before observing any trials our prior is not uniform: Beta(a=20,b=20) Coin tossing revisited 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) ● What is our belief about θ after observing one “tail” ? ● With a uniform prior it was: What will it look like with our prior? Coin tossing with prior knowledge 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) ● Our belief about θ after observing one “tail” hardly changes. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 17 , n =5 0) ● After 50 trials, it’s much like before. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 19 48 , n =5 00 0) ● After 5,000 trials, it’s virtually identical to the uniform prior. What did we gain?