程序代写代做代考 information theory Excel Bayesian COMP2610 / COMP6261 – Information Theory – Lecture 5: Bernoulli, Binomial, Maximum Likelihood and MAP

COMP2610 / COMP6261 – Information Theory – Lecture 5: Bernoulli, Binomial, Maximum Likelihood and MAP

COMP2610 / COMP6261 – Information Theory
Lecture 5: Bernoulli, Binomial, Maximum Likelihood and MAP

Robert Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

6 August 2018

1 / 36

Last time

Examples of application of Bayes’ rule
I Formalizing problems in language of probability
I Eating hamburgers, detecting terrorists, …

Frequentist vs Bayesian probabilities

2 / 36

The Bayesian Inference Framework

Bayesian Inference
Bayesian inference provides us with a a mathematical framework
explaining how to change our (prior) beliefs in the light of new evidence.

p(Z |X )︸ ︷︷ ︸
posterior

=

likelihood︷ ︸︸ ︷
p(X |Z )

prior︷︸︸︷
p(Z )

p(X )︸ ︷︷ ︸
evidence

=
p(X |Z )p(Z )∑

Z ′ p(X |Z ′)p(Z ′)

Prior: Belief that someone is sick

Likelihood: Probability of testing positive given someone is sick

Posterior: Probability of being sick given someone tests positive
3 / 36

This time

The Bernoulli and binomial distribution (we will make much use of this
henceforth in studying binary channels)

Estimating probabilities from data

Bayesian inference for parameter estimation

4 / 36

Outline

1 The Bernoulli Distribution

2 The Binomial Distribution

3 Parameter Estimation

4 Bayesian Parameter Estimation

5 Wrapping up

5 / 36

1 The Bernoulli Distribution

2 The Binomial Distribution

3 Parameter Estimation

4 Bayesian Parameter Estimation

5 Wrapping up

6 / 36

The Bernoulli Distribution
Introduction

Consider a binary variable X ∈ {0, 1}. It could represent many things:
Whether a coin lands heads or tails

The presence/absence of a word in a document

A transmitted bit in a message

The success of a medical trial

Often, these outcomes (0 or 1) are not equally likely

What is a general way to model such an X?

7 / 36

The Bernoulli Distribution
Definition

The variable X takes on the outcomes

X =

{
1 probability θ

0 probability 1− θ

Here, 0 ≤ θ ≤ 1 is a parameter representing the probability of success

For higher values of θ, it is more likely to see 1 than 0

e.g. a biased coin

8 / 36

The Bernoulli Distribution
Definition

By definition,

p(X = 1|θ) = θ
p(X = 0|θ) = 1− θ

More succinctly,
p(X = x |θ) = θx(1− θ)1−x

This is known as a Bernoulli distribution over binary outcomes:

p(X = x |θ) = Bern(x |θ) = θx(1− θ)1−x

Note the use of the conditioning symbol for θ; will revisit later

9 / 36

The Bernoulli Distribution
Definition

By definition,

p(X = 1|θ) = θ
p(X = 0|θ) = 1− θ

More succinctly,
p(X = x |θ) = θx(1− θ)1−x

This is known as a Bernoulli distribution over binary outcomes:

p(X = x |θ) = Bern(x |θ) = θx(1− θ)1−x

Note the use of the conditioning symbol for θ; will revisit later

9 / 36

The Bernoulli Distribution
Mean and Variance

The expected value (or mean) is given by:

E[X |θ] =

x∈{0,1}

x · p(x |θ)

= 1 · p(X = 1|θ) + 0 · p(X = 0|θ)
= θ.

The variance (or squared standard deviation) is given by:

V[X |θ] = E[(X − E[X ])2]
= E[(X − θ)2]
= (0− θ)2 · p(X = 0|θ) + (1− θ)2 · p(X = 1|θ)
= θ(1− θ).

10 / 36

Example: Binary Symmetric Channel

Suppose a sender transmits messages s that are sequences of bits

The receiver sees the bit sequence (message) r

Due to noise in the channel, the message is flipped with probability
0 ≤ f ≤ 1

s r

0

1

0

1

1− f

1− f

f

f

11 / 36

Example: Binary Symmetric Channel

We can think of r as the outcome of a random variable, with conditional
distribution given by:

s r

0

1

0

1

1− f

1− f

f

f

p(r = 0|s = 0) = 1− f p(r = 0|s = 1) = f
p(r = 1|s = 0) = f p(r = 1|s = 1) = 1− f

If E denotes whether an error occurred, clearly

p(E = e) = Bern(e|f ), e ∈ {0, 1}.

12 / 36

1 The Bernoulli Distribution

2 The Binomial Distribution

3 Parameter Estimation

4 Bayesian Parameter Estimation

5 Wrapping up

13 / 36

The Binomial Distribution
Introduction

Suppose we perform N independent Bernoulli trials

e.g. we toss a coin N times

e.g. we transmit a sequence of N bits across a noisy channel

Each trial has probability θ of success

What is the distribution of the number of times (m) that X = 1?

e.g. the number of times we obtained m heads

e.g. the number of errors in the transmitted sequence

14 / 36

The Binomial Distribution
Definition

Let

Y =
N∑

i=1

Xi

where Xi ∼ Bern(θ).

Then Y has a binomial distribution with parameters N, θ:

p(Y = m) = Bin(m|N, θ) =
(

N
m

)
θm(1− θ)N−m

for m ∈ {0, 1, . . . ,N}. Here(
N
m

)
=

N!
(N −m)!m!

is the # of ways we can we obtain m heads out of N coin flips
15 / 36

The Binomial Distribution:
Mean and Variance

It is easy to show that:

E[Y ] =
N∑

m=0

m · Bin(m|N, θ) = Nθ

V[Y ] =
N∑

m=0

(m − E[m])2 · Bin(m|N, θ) = Nθ(1− θ)

Follows from linearity of mean and variance

E[Y ] = E

[
N∑

i=1

Xi

]
=

N∑
i=1

E [Xi ] = Nθ

V[Y ] = V

[
N∑

i=1

Xi

]
=

N∑
i=1

V [Xi ] = Nθ(1− θ)

16 / 36

The Binomial Distribution:
Example

Ashton is an excellent off spinner. The probability of him getting a wicket
during a cricket match is 14 . (That is, on each attempt, there is a 1/4
chance he will get a wicket.)

His coach commands him to make 10 attempts of wickets in a particular
game.

1 What is the probability that he will get exactly three wickets?
Bin(3|10, 0.25)

2 What is the expected number of wickets he will get?
E[Y ], where Y ∼ Bin(·|10, 0.25).

3 What is the probability that he will get at least one wicket?∑10
m=1 Bin(m|N = 10, θ = 0.25) = 1− Bin(m = 0|N = 10, θ = 0.25)

17 / 36

The Binomial Distribution:
Example: Distribution of the Number of Wickets

Histogram of the binomial distribution with N = 10 and θ = 0.25. From Bishop (PRML, 2006)

18 / 36

1 The Bernoulli Distribution

2 The Binomial Distribution

3 Parameter Estimation

4 Bayesian Parameter Estimation

5 Wrapping up

19 / 36

The Bernoulli Distribution: Parameter Estimation

Consider the set of observations D = {x1, . . . , xN} with xi ∈ {0, 1}:
The outcomes of a sequence of coin flips

Whether or not there are errors in a transmitted bit string

Each observation is the outcome of a random variable X , with distribution

p(X = x) = Bern(x |θ) = θx(1− θ)1−x

for some parameter θ

20 / 36

The Bernoulli Distribution: Parameter Estimation

We know that

X ∼ Bern(x |θ) = θx(1− θ)1−x

But often, we don’t know what the value of θ is

The probability of a coin toss resulting in heads

The probability of the word defence appearing in a document about
sports

What would be a reasonable estimate for θ from D?

21 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

Say that we observe

D = {0, 0, 0, 1, 0, 0, 1, 0, 0, 0}

Intuitively, which seems more plausible: θ = 12? θ =
1
5?

22 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

Say that we observe

D = {0, 0, 0, 1, 0, 0, 1, 0, 0, 0}

If it were true that θ = 12 , then the probability of this sequence would be

p(D|θ) =
10∏

i=1

p(xi |θ)

=
10∏

i=1

1
2

=
1

210

≈ 0.001.

23 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

Say that we observe

D = {0, 0, 0, 1, 0, 0, 1, 0, 0, 0}

If it were true that θ = 15 , then the probability of this sequence would be

p(D|θ) =
10∏

i=1

p(xi |θ)

=

(
1
5

)2
·
(

4
5

)8
≈ 0.007.

24 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

We can write down how likely D is under the Bernoulli model. Assuming
independent observations:

p(D|θ) =
N∏

i=1

p(xi |θ) =
N∏

i=1

θxi (1− θ)1−xi

We call L(θ) = p(D|θ) the likelihood function

Maximum likelihood principle: We want to maximize this function wrt θ

The parameter for which the observed sequence has the highest
probability

25 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

We can write down how likely D is under the Bernoulli model. Assuming
independent observations:

p(D|θ) =
N∏

i=1

p(xi |θ) =
N∏

i=1

θxi (1− θ)1−xi

We call L(θ) = p(D|θ) the likelihood function

Maximum likelihood principle: We want to maximize this function wrt θ

The parameter for which the observed sequence has the highest
probability

25 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)

L(θ) = log p(D|θ) =
N∑

i=1

log p(xi |θ) =
N∑

i=1

[xi log θ + (1− xi) log(1− θ)]

Setting dLdθ = 0 we obtain:

θML =
1
N

N∑
i=1

xi

The proportion of times x = 1 in the dataset D!

26 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)

L(θ) = log p(D|θ) =
N∑

i=1

log p(xi |θ) =
N∑

i=1

[xi log θ + (1− xi) log(1− θ)]

Setting dLdθ = 0 we obtain:

θML =
1
N

N∑
i=1

xi

The proportion of times x = 1 in the dataset D!

26 / 36

The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood

Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)

L(θ) = log p(D|θ) =
N∑

i=1

log p(xi |θ) =
N∑

i=1

[xi log θ + (1− xi) log(1− θ)]

Setting dLdθ = 0 we obtain:

θML =
1
N

N∑
i=1

xi

The proportion of times x = 1 in the dataset D!

26 / 36

The Bernoulli Distribution:
Parameter Estimation — Issues with Maximum Likelihood

Consider the following scenarios:
After N = 3 coin flips we obtained 3 ‘tails’

I What is the estimate of the probability of a coin flip resulting in ‘heads’?
In a small set of documents about sports, the words defence never
appeared.

I What are the consequences when predicting whether a document is
about sports (using Bayes’ rule)?

These issues are usually referred to as overfitting

Need to “smooth” out our parameter estimates

Alternatively, we can do Bayesian inference by considering priors over
the parameters

27 / 36

The Bernoulli Distribution:
Parameter Estimation — Issues with Maximum Likelihood

Consider the following scenarios:
After N = 3 coin flips we obtained 3 ‘tails’

I What is the estimate of the probability of a coin flip resulting in ‘heads’?
In a small set of documents about sports, the words defence never
appeared.

I What are the consequences when predicting whether a document is
about sports (using Bayes’ rule)?

These issues are usually referred to as overfitting

Need to “smooth” out our parameter estimates

Alternatively, we can do Bayesian inference by considering priors over
the parameters

27 / 36

1 The Bernoulli Distribution

2 The Binomial Distribution

3 Parameter Estimation

4 Bayesian Parameter Estimation

5 Wrapping up

28 / 36

The Bernoulli Distribution:
Parameter Estimation: Bayesian Inference

Recall:

p(θ|X )︸ ︷︷ ︸
posterior

=

likelihood︷ ︸︸ ︷
p(X |θ)

prior︷︸︸︷
p(θ)

p(X )︸ ︷︷ ︸
evidence

If we treat θ as a random variable, we may have some prior belief p(θ)
about its value

e.g. we believe θ is probably close to 0.5

Our prior on θ quantifies what we believe θ is likely to be, before looking at
the data

Our posterior on θ quantifies what we believe θ is likely to be, after looking
at the data

29 / 36

The Bernoulli Distribution:
Parameter Estimation: Bayesian Inference

The likelihood of X given θ is

Bern(x |θ) = θx(1− θ)1−x

For the prior, it is mathematically convenient to express it as a Beta
distribution:

Beta(θ|a, b) = 1
Z (a, b)

θa−1(1− θ)b−1,

where Z (a, b) is a suitable normaliser

We can tune a, b to reflect our belief in the range of likely values of θ

30 / 36

Beta Prior
Examples

0 0.2 0.4 0.6 0.8 1
0

0.5

1

1.5

2

2.5

3

θ

p

)

a=0.1, b=0.1

0 0.2 0.4 0.6 0.8 1
0

0.5

1

1.5

2

2.5

3

θ

p

)

a=1, b=1

0 0.2 0.4 0.6 0.8 1
0

0.5

1

1.5

2

2.5

3

θ

p

)

a=2, b=3

0 0.2 0.4 0.6 0.8 1
0

0.5

1

1.5

2

2.5

3

θ

p

)

a=8, b=4

31 / 36

Beta Prior and Binomial Likelihood:
Beta Posterior Distribution

Recall that for D = {x1, . . . , xN}, the likelihood under a Bernoulli model is:

p(D|θ) = θm(1− θ)`,

where m = ](x = 1) and `
def
= N −m = ](x = 0).

For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:

p(θ|D, a, b) = p(D|θ)p(θ|a, b)
p(D|a, b)

=
p(D|θ)p(θ|a, b)∫ 1

0 p(D|θ)p(θ|a, b)dθ
= Beta(θ|m + a, `+ b).

Can use this as our new prior if we see more data!

32 / 36

Beta Prior and Binomial Likelihood:
Beta Posterior Distribution

Recall that for D = {x1, . . . , xN}, the likelihood under a Bernoulli model is:

p(D|θ) = θm(1− θ)`,

where m = ](x = 1) and `
def
= N −m = ](x = 0).

For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:

p(θ|D, a, b) = p(D|θ)p(θ|a, b)
p(D|a, b)

=
p(D|θ)p(θ|a, b)∫ 1

0 p(D|θ)p(θ|a, b)dθ
= Beta(θ|m + a, `+ b).

Can use this as our new prior if we see more data!

32 / 36

Beta Prior and Binomial Likelihood:
Beta Posterior Distribution

Recall that for D = {x1, . . . , xN}, the likelihood under a Bernoulli model is:

p(D|θ) = θm(1− θ)`,

where m = ](x = 1) and `
def
= N −m = ](x = 0).

For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:

p(θ|D, a, b) = p(D|θ)p(θ|a, b)
p(D|a, b)

=
p(D|θ)p(θ|a, b)∫ 1

0 p(D|θ)p(θ|a, b)dθ
= Beta(θ|m + a, `+ b).

Can use this as our new prior if we see more data!

32 / 36

Beta Prior and Binomial Likelihood:
Beta Posterior Distribution

Now suppose we choose θMAP to maximise p(θ|D)
(MAP= Maximum A Posteriori)

One can show that
θMAP =

m + a− 1
N + a + b − 2

cf. the estimate that did not use any prior,

θML =
m
N

The prior parameters a and b can be seen as adding some “fake” trials!

What values of a and b ensure θMAP = θML? a = b = 1. Make sense?
(Note that the choice of the beta distribution was not accidental here — it
is the “conjugate prior” for the binomial distribution. )

33 / 36

1 The Bernoulli Distribution

2 The Binomial Distribution

3 Parameter Estimation

4 Bayesian Parameter Estimation

5 Wrapping up

34 / 36

Summary

Distributions involving binary random variables
I Bernoulli distribution

I Binomial distribution
Bayesian inference: Full posterior on the parameters

I Beta prior and binomial likelihood→ Beta posterior
Reading: Mackay §23.1 and §23.5; Bishop §2.1 and §2.2

35 / 36

Next time

Entropy

36 / 36

The Bernoulli Distribution
The Binomial Distribution
Parameter Estimation
Bayesian Parameter Estimation
Wrapping up