COMP2610 / COMP6261 – Information Theory – Lecture 5: Bernoulli, Binomial, Maximum Likelihood and MAP
COMP2610 / COMP6261 – Information Theory
Lecture 5: Bernoulli, Binomial, Maximum Likelihood and MAP
Robert Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
6 August 2018
1 / 36
Last time
Examples of application of Bayes’ rule
I Formalizing problems in language of probability
I Eating hamburgers, detecting terrorists, …
Frequentist vs Bayesian probabilities
2 / 36
The Bayesian Inference Framework
Bayesian Inference
Bayesian inference provides us with a a mathematical framework
explaining how to change our (prior) beliefs in the light of new evidence.
p(Z |X )︸ ︷︷ ︸
posterior
=
likelihood︷ ︸︸ ︷
p(X |Z )
prior︷︸︸︷
p(Z )
p(X )︸ ︷︷ ︸
evidence
=
p(X |Z )p(Z )∑
Z ′ p(X |Z ′)p(Z ′)
Prior: Belief that someone is sick
Likelihood: Probability of testing positive given someone is sick
Posterior: Probability of being sick given someone tests positive
3 / 36
This time
The Bernoulli and binomial distribution (we will make much use of this
henceforth in studying binary channels)
Estimating probabilities from data
Bayesian inference for parameter estimation
4 / 36
Outline
1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up
5 / 36
1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up
6 / 36
The Bernoulli Distribution
Introduction
Consider a binary variable X ∈ {0, 1}. It could represent many things:
Whether a coin lands heads or tails
The presence/absence of a word in a document
A transmitted bit in a message
The success of a medical trial
Often, these outcomes (0 or 1) are not equally likely
What is a general way to model such an X?
7 / 36
The Bernoulli Distribution
Definition
The variable X takes on the outcomes
X =
{
1 probability θ
0 probability 1− θ
Here, 0 ≤ θ ≤ 1 is a parameter representing the probability of success
For higher values of θ, it is more likely to see 1 than 0
e.g. a biased coin
8 / 36
The Bernoulli Distribution
Definition
By definition,
p(X = 1|θ) = θ
p(X = 0|θ) = 1− θ
More succinctly,
p(X = x |θ) = θx(1− θ)1−x
This is known as a Bernoulli distribution over binary outcomes:
p(X = x |θ) = Bern(x |θ) = θx(1− θ)1−x
Note the use of the conditioning symbol for θ; will revisit later
9 / 36
The Bernoulli Distribution
Definition
By definition,
p(X = 1|θ) = θ
p(X = 0|θ) = 1− θ
More succinctly,
p(X = x |θ) = θx(1− θ)1−x
This is known as a Bernoulli distribution over binary outcomes:
p(X = x |θ) = Bern(x |θ) = θx(1− θ)1−x
Note the use of the conditioning symbol for θ; will revisit later
9 / 36
The Bernoulli Distribution
Mean and Variance
The expected value (or mean) is given by:
E[X |θ] =
∑
x∈{0,1}
x · p(x |θ)
= 1 · p(X = 1|θ) + 0 · p(X = 0|θ)
= θ.
The variance (or squared standard deviation) is given by:
V[X |θ] = E[(X − E[X ])2]
= E[(X − θ)2]
= (0− θ)2 · p(X = 0|θ) + (1− θ)2 · p(X = 1|θ)
= θ(1− θ).
10 / 36
Example: Binary Symmetric Channel
Suppose a sender transmits messages s that are sequences of bits
The receiver sees the bit sequence (message) r
Due to noise in the channel, the message is flipped with probability
0 ≤ f ≤ 1
s r
0
1
0
1
1− f
1− f
f
f
11 / 36
Example: Binary Symmetric Channel
We can think of r as the outcome of a random variable, with conditional
distribution given by:
s r
0
1
0
1
1− f
1− f
f
f
p(r = 0|s = 0) = 1− f p(r = 0|s = 1) = f
p(r = 1|s = 0) = f p(r = 1|s = 1) = 1− f
If E denotes whether an error occurred, clearly
p(E = e) = Bern(e|f ), e ∈ {0, 1}.
12 / 36
1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up
13 / 36
The Binomial Distribution
Introduction
Suppose we perform N independent Bernoulli trials
e.g. we toss a coin N times
e.g. we transmit a sequence of N bits across a noisy channel
Each trial has probability θ of success
What is the distribution of the number of times (m) that X = 1?
e.g. the number of times we obtained m heads
e.g. the number of errors in the transmitted sequence
14 / 36
The Binomial Distribution
Definition
Let
Y =
N∑
i=1
Xi
where Xi ∼ Bern(θ).
Then Y has a binomial distribution with parameters N, θ:
p(Y = m) = Bin(m|N, θ) =
(
N
m
)
θm(1− θ)N−m
for m ∈ {0, 1, . . . ,N}. Here(
N
m
)
=
N!
(N −m)!m!
is the # of ways we can we obtain m heads out of N coin flips
15 / 36
The Binomial Distribution:
Mean and Variance
It is easy to show that:
E[Y ] =
N∑
m=0
m · Bin(m|N, θ) = Nθ
V[Y ] =
N∑
m=0
(m − E[m])2 · Bin(m|N, θ) = Nθ(1− θ)
Follows from linearity of mean and variance
E[Y ] = E
[
N∑
i=1
Xi
]
=
N∑
i=1
E [Xi ] = Nθ
V[Y ] = V
[
N∑
i=1
Xi
]
=
N∑
i=1
V [Xi ] = Nθ(1− θ)
16 / 36
The Binomial Distribution:
Example
Ashton is an excellent off spinner. The probability of him getting a wicket
during a cricket match is 14 . (That is, on each attempt, there is a 1/4
chance he will get a wicket.)
His coach commands him to make 10 attempts of wickets in a particular
game.
1 What is the probability that he will get exactly three wickets?
Bin(3|10, 0.25)
2 What is the expected number of wickets he will get?
E[Y ], where Y ∼ Bin(·|10, 0.25).
3 What is the probability that he will get at least one wicket?∑10
m=1 Bin(m|N = 10, θ = 0.25) = 1− Bin(m = 0|N = 10, θ = 0.25)
17 / 36
The Binomial Distribution:
Example: Distribution of the Number of Wickets
Histogram of the binomial distribution with N = 10 and θ = 0.25. From Bishop (PRML, 2006)
18 / 36
1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up
19 / 36
The Bernoulli Distribution: Parameter Estimation
Consider the set of observations D = {x1, . . . , xN} with xi ∈ {0, 1}:
The outcomes of a sequence of coin flips
Whether or not there are errors in a transmitted bit string
Each observation is the outcome of a random variable X , with distribution
p(X = x) = Bern(x |θ) = θx(1− θ)1−x
for some parameter θ
20 / 36
The Bernoulli Distribution: Parameter Estimation
We know that
X ∼ Bern(x |θ) = θx(1− θ)1−x
But often, we don’t know what the value of θ is
The probability of a coin toss resulting in heads
The probability of the word defence appearing in a document about
sports
What would be a reasonable estimate for θ from D?
21 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
Say that we observe
D = {0, 0, 0, 1, 0, 0, 1, 0, 0, 0}
Intuitively, which seems more plausible: θ = 12? θ =
1
5?
22 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
Say that we observe
D = {0, 0, 0, 1, 0, 0, 1, 0, 0, 0}
If it were true that θ = 12 , then the probability of this sequence would be
p(D|θ) =
10∏
i=1
p(xi |θ)
=
10∏
i=1
1
2
=
1
210
≈ 0.001.
23 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
Say that we observe
D = {0, 0, 0, 1, 0, 0, 1, 0, 0, 0}
If it were true that θ = 15 , then the probability of this sequence would be
p(D|θ) =
10∏
i=1
p(xi |θ)
=
(
1
5
)2
·
(
4
5
)8
≈ 0.007.
24 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
We can write down how likely D is under the Bernoulli model. Assuming
independent observations:
p(D|θ) =
N∏
i=1
p(xi |θ) =
N∏
i=1
θxi (1− θ)1−xi
We call L(θ) = p(D|θ) the likelihood function
Maximum likelihood principle: We want to maximize this function wrt θ
The parameter for which the observed sequence has the highest
probability
25 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
We can write down how likely D is under the Bernoulli model. Assuming
independent observations:
p(D|θ) =
N∏
i=1
p(xi |θ) =
N∏
i=1
θxi (1− θ)1−xi
We call L(θ) = p(D|θ) the likelihood function
Maximum likelihood principle: We want to maximize this function wrt θ
The parameter for which the observed sequence has the highest
probability
25 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)
L(θ) = log p(D|θ) =
N∑
i=1
log p(xi |θ) =
N∑
i=1
[xi log θ + (1− xi) log(1− θ)]
Setting dLdθ = 0 we obtain:
θML =
1
N
N∑
i=1
xi
The proportion of times x = 1 in the dataset D!
26 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)
L(θ) = log p(D|θ) =
N∑
i=1
log p(xi |θ) =
N∑
i=1
[xi log θ + (1− xi) log(1− θ)]
Setting dLdθ = 0 we obtain:
θML =
1
N
N∑
i=1
xi
The proportion of times x = 1 in the dataset D!
26 / 36
The Bernoulli Distribution: Parameter Estimation:
Maximum Likelihood
Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)
L(θ) = log p(D|θ) =
N∑
i=1
log p(xi |θ) =
N∑
i=1
[xi log θ + (1− xi) log(1− θ)]
Setting dLdθ = 0 we obtain:
θML =
1
N
N∑
i=1
xi
The proportion of times x = 1 in the dataset D!
26 / 36
The Bernoulli Distribution:
Parameter Estimation — Issues with Maximum Likelihood
Consider the following scenarios:
After N = 3 coin flips we obtained 3 ‘tails’
I What is the estimate of the probability of a coin flip resulting in ‘heads’?
In a small set of documents about sports, the words defence never
appeared.
I What are the consequences when predicting whether a document is
about sports (using Bayes’ rule)?
These issues are usually referred to as overfitting
Need to “smooth” out our parameter estimates
Alternatively, we can do Bayesian inference by considering priors over
the parameters
27 / 36
The Bernoulli Distribution:
Parameter Estimation — Issues with Maximum Likelihood
Consider the following scenarios:
After N = 3 coin flips we obtained 3 ‘tails’
I What is the estimate of the probability of a coin flip resulting in ‘heads’?
In a small set of documents about sports, the words defence never
appeared.
I What are the consequences when predicting whether a document is
about sports (using Bayes’ rule)?
These issues are usually referred to as overfitting
Need to “smooth” out our parameter estimates
Alternatively, we can do Bayesian inference by considering priors over
the parameters
27 / 36
1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up
28 / 36
The Bernoulli Distribution:
Parameter Estimation: Bayesian Inference
Recall:
p(θ|X )︸ ︷︷ ︸
posterior
=
likelihood︷ ︸︸ ︷
p(X |θ)
prior︷︸︸︷
p(θ)
p(X )︸ ︷︷ ︸
evidence
If we treat θ as a random variable, we may have some prior belief p(θ)
about its value
e.g. we believe θ is probably close to 0.5
Our prior on θ quantifies what we believe θ is likely to be, before looking at
the data
Our posterior on θ quantifies what we believe θ is likely to be, after looking
at the data
29 / 36
The Bernoulli Distribution:
Parameter Estimation: Bayesian Inference
The likelihood of X given θ is
Bern(x |θ) = θx(1− θ)1−x
For the prior, it is mathematically convenient to express it as a Beta
distribution:
Beta(θ|a, b) = 1
Z (a, b)
θa−1(1− θ)b−1,
where Z (a, b) is a suitable normaliser
We can tune a, b to reflect our belief in the range of likely values of θ
30 / 36
Beta Prior
Examples
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
θ
p
(θ
)
a=0.1, b=0.1
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
θ
p
(θ
)
a=1, b=1
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
θ
p
(θ
)
a=2, b=3
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
θ
p
(θ
)
a=8, b=4
31 / 36
Beta Prior and Binomial Likelihood:
Beta Posterior Distribution
Recall that for D = {x1, . . . , xN}, the likelihood under a Bernoulli model is:
p(D|θ) = θm(1− θ)`,
where m = ](x = 1) and `
def
= N −m = ](x = 0).
For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:
p(θ|D, a, b) = p(D|θ)p(θ|a, b)
p(D|a, b)
=
p(D|θ)p(θ|a, b)∫ 1
0 p(D|θ)p(θ|a, b)dθ
= Beta(θ|m + a, `+ b).
Can use this as our new prior if we see more data!
32 / 36
Beta Prior and Binomial Likelihood:
Beta Posterior Distribution
Recall that for D = {x1, . . . , xN}, the likelihood under a Bernoulli model is:
p(D|θ) = θm(1− θ)`,
where m = ](x = 1) and `
def
= N −m = ](x = 0).
For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:
p(θ|D, a, b) = p(D|θ)p(θ|a, b)
p(D|a, b)
=
p(D|θ)p(θ|a, b)∫ 1
0 p(D|θ)p(θ|a, b)dθ
= Beta(θ|m + a, `+ b).
Can use this as our new prior if we see more data!
32 / 36
Beta Prior and Binomial Likelihood:
Beta Posterior Distribution
Recall that for D = {x1, . . . , xN}, the likelihood under a Bernoulli model is:
p(D|θ) = θm(1− θ)`,
where m = ](x = 1) and `
def
= N −m = ](x = 0).
For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:
p(θ|D, a, b) = p(D|θ)p(θ|a, b)
p(D|a, b)
=
p(D|θ)p(θ|a, b)∫ 1
0 p(D|θ)p(θ|a, b)dθ
= Beta(θ|m + a, `+ b).
Can use this as our new prior if we see more data!
32 / 36
Beta Prior and Binomial Likelihood:
Beta Posterior Distribution
Now suppose we choose θMAP to maximise p(θ|D)
(MAP= Maximum A Posteriori)
One can show that
θMAP =
m + a− 1
N + a + b − 2
cf. the estimate that did not use any prior,
θML =
m
N
The prior parameters a and b can be seen as adding some “fake” trials!
What values of a and b ensure θMAP = θML? a = b = 1. Make sense?
(Note that the choice of the beta distribution was not accidental here — it
is the “conjugate prior” for the binomial distribution. )
33 / 36
1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up
34 / 36
Summary
Distributions involving binary random variables
I Bernoulli distribution
I Binomial distribution
Bayesian inference: Full posterior on the parameters
I Beta prior and binomial likelihood→ Beta posterior
Reading: Mackay §23.1 and §23.5; Bishop §2.1 and §2.2
35 / 36
Next time
Entropy
36 / 36
The Bernoulli Distribution
The Binomial Distribution
Parameter Estimation
Bayesian Parameter Estimation
Wrapping up