COMP2610 / COMP6261 – Information Theory – Lecture 6: Entropy
COMP2610 / COMP6261 – Information Theory
Lecture 6: Entropy
Robert Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
7 August 2018
1 / 39
Last time
The Bernoulli and Binomial distributions
Maximum likelihood estimation
Bayesian parameter etimation
2 / 39
This time
Information content and entropy
Examples and intuition
Some basic properties of entropy
3 / 39
Outline
1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
4 Joint Entropy, Conditional Entropy and Chain Rule
5 An Axiomatic Characterisation
6 Wrapping up
4 / 39
Recap: A General Communication System
How informative is a message?
5 / 39
Information Content: Informally
Say that a message comprises a single bit (one binary random variable)
Whether or not a coin comes up heads
Whether or not my favourite horse wins a race
Informally, the amount of information in such a message is:
How unexpected or “surprising” an outcome of random variable is
I If a coin comes up Heads 99.99% of the time, the message “Tails” is
much more informative than “Heads”
I If I believe my favourite horse will win with 99.99% probability, it is
surprising to find out it did not
How predictable a random variable is
I If a coin comes up Heads 99.99% of the time, we can predict the next
message as “Heads” and be right most of the time
I If I believe my favourite horse will win with 99.99% probability, then I
believe predicting so to be right most of the time
6 / 39
Information Content: Informally
Say that a message comprises a single bit (one binary random variable)
Whether or not a coin comes up heads
Whether or not my favourite horse wins a race
Informally, the amount of information in such a message is:
How unexpected or “surprising” an outcome of random variable is
I If a coin comes up Heads 99.99% of the time, the message “Tails” is
much more informative than “Heads”
I If I believe my favourite horse will win with 99.99% probability, it is
surprising to find out it did not
How predictable a random variable is
I If a coin comes up Heads 99.99% of the time, we can predict the next
message as “Heads” and be right most of the time
I If I believe my favourite horse will win with 99.99% probability, then I
believe predicting so to be right most of the time
6 / 39
Information Content: Formally
Intuitively, we measure information of a message in relation to the other
messages we could have seen
For binary messages, we either see 0 or 1
The message 1 is informative when there is a good chance I might
have seen 0
How can we formalise and thus measure information content?
Information content of an outcome must depend on its probability
Information content of a random variable must depend on its
probability distribution
7 / 39
Information Content of an Outcome: Definition
Let X be a discrete r.v. with possible outcomes X
e.g. X = {0, 1}
e.g. X = {Yes, No, Maybe}
Let p(x) denote the probability of outcome x ∈ X
The information content of an outcome x ∈ X is:
h(x) = log2
1
p(x)
8 / 39
Information Content of an Outcome: Properties
The information content of x grows as p(x) shrinks
Outcomes that are rare are deemed to contain more information
Choice of logarithm basis is arbitrary
If we use log2 we measure information in bits
What about other functions of p(x), e.g. 1p(x)2 − 1?
9 / 39
Entropy of a Random Variable: Definition
Let X be a discrete r.v. with possible outcomes X .
The entropy of the random variable X is the average information content of
the outcomes:
H(X ) = Ex [h(x)]
=
∑
x
p(x) · log2
1
p(x)
= −
∑
x
p(x) log2 p(x)
where we define 0 log 0 ≡ 0, as limp→0 p log p = 0.
10 / 39
Entropy of a Random Variable
Some Basic Properties
Non-negativity:
0 ≤ p(x) ≤ 1⇒ log
1
p(x)
≥ 0
⇒
∑
x
p(x) log
1
p(x)
≥ 0
⇒ H(X )≥ 0
Change of base:
Hb(X ) = −
∑
x
p(x) logb p(x)
=
∑
x
p(x) loga p(x) logb a
Hb(X )= logb a Ha(X )
I If we use log2 the units are called bits
I If we use natural logarithm the units are called nats
11 / 39
Entropy of a Random Variable
Some Basic Properties
Non-negativity:
0 ≤ p(x) ≤ 1⇒ log
1
p(x)
≥ 0
⇒
∑
x
p(x) log
1
p(x)
≥ 0
⇒ H(X )≥ 0
Change of base:
Hb(X ) = −
∑
x
p(x) logb p(x)
=
∑
x
p(x) loga p(x) logb a
Hb(X )= logb a Ha(X )
I If we use log2 the units are called bits
I If we use natural logarithm the units are called nats
11 / 39
Unrolling the Definition
The entropy of X is
H(X ) = −
∑
x
p(x)log2 p(x).
Pick a random outcome x , and see how large its probability is
Average information content of each outcome
Does not depend on the values of the outcomes
Only on their probabilities
Contrast with expectation E[X ] =
∑
x x · p(X = x).
12 / 39
What Does Entropy “Mean”?
Not a well posed question.
Entropy does match some intuitive properties of our informal notion of
“information content”
Rare outcomes provide more information
But other functions also seem plausible, e.g.
G(X ) =
∑
x
p(x)
1
p(x)2
=
∑
x
1
p(x)
.
We will see some examples where our definition of entropy arises naturally
The main justification is the results we can obtain with it.
13 / 39
1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
4 Joint Entropy, Conditional Entropy and Chain Rule
5 An Axiomatic Characterisation
6 Wrapping up
14 / 39
Entropy of a Random Variable
Example 1 — Bernoulli Distribution
Let X ∈ {0, 1} with X ∼ Bern(X |θ)
Then,
p(X = 0) = 1− θ
p(X = 1) = θ
So, the entropy of a Bernoulli random variable is
H(X ) = −
∑
x∈{0,1}
p(x) · log p(x)
= −θ log θ − (1− θ) log(1− θ)
15 / 39
Entropy of a Random Variable
Example 1 — Bernoulli Distribution
Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)
H(X ) = −θ log θ − (1− θ) log(1− θ)
0 0.5 1
0
0.2
0.4
0.6
0.8
1
θ = p(X=1)
H
2
(θ
)
Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.
16 / 39
Entropy of a Random Variable
Example 1 — Bernoulli Distribution
Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)
H(X ) = −θ log θ − (1− θ) log(1− θ)
0 0.5 1
0
0.2
0.4
0.6
0.8
1
θ = p(X=1)
H
2
(θ
)
Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.
16 / 39
Entropy of a Random Variable
Example 1 — Bernoulli Distribution
Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)
H(X ) = −θ log θ − (1− θ) log(1− θ)
0 0.5 1
0
0.2
0.4
0.6
0.8
1
θ = p(X=1)
H
2
(θ
)
Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.
16 / 39
Entropy of a Random Variable
Example 1 — Bernoulli Distribution
Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)
H(X ) = −θ log θ − (1− θ) log(1− θ)
0 0.5 1
0
0.2
0.4
0.6
0.8
1
θ = p(X=1)
H
2
(θ
)
Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.
16 / 39
Entropy of a Random Variable
Example 1 — Bernoulli Distribution
Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)
H(X ) = −θ log θ − (1− θ) log(1− θ)
0 0.5 1
0
0.2
0.4
0.6
0.8
1
θ = p(X=1)
H
2
(θ
)
Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.
16 / 39
Entropy of a Random Variable
Example 2
Consider a random variable X with uniform distribution over 32 outcomes:
The entropy of this rv is given by:
H(X ) = −
32∑
i=1
p(i) log2 p(i) = −
32∑
i=1
1
32
log2
1
32
= log2 32 = 5 bits.
17 / 39
Entropy of a Random Variable
Example 3 — Categorical Distribution
Categorical distributions with 30 different states:
Figure from Bishop, PRML, 2006)
The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats
I When will the entropy be minimum?
18 / 39
Entropy of a Random Variable
Example 3 — Categorical Distribution
Categorical distributions with 30 different states:
Figure from Bishop, PRML, 2006)
The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats
I When will the entropy be minimum?
18 / 39
Entropy of a Random Variable
Example 3 — Categorical Distribution
Categorical distributions with 30 different states:
Figure from Bishop, PRML, 2006)
The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats
I When will the entropy be minimum?
18 / 39
Entropy of a Random Variable
Example 3 — Categorical Distribution
Categorical distributions with 30 different states:
Figure from Bishop, PRML, 2006)
The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats
I When will the entropy be minimum?
18 / 39
Entropy of a Random Variable
Maximum Entropy
Consider a discrete variable X taking on values from the set X
Let pi be the probability of each state, with i = 1, . . . , |X |
Denote the vector of probabilities with p
The entropy is maximized if p is uniform:
H(X ) ≤ log2|X |
with equality iff pi =
1
|X |
for all i
Note log2|X | is the number of bits needed to describe an outcome of X
19 / 39
Entropy of a Random Variable
Maximum Entropy
Consider a discrete variable X taking on values from the set X
Let pi be the probability of each state, with i = 1, . . . , |X |
Denote the vector of probabilities with p
The entropy is maximized if p is uniform:
H(X ) ≤ log2|X |
with equality iff pi =
1
|X |
for all i
Note log2|X | is the number of bits needed to describe an outcome of X
19 / 39
Entropy of a Random Variable
Maximum Entropy
Consider a discrete variable X taking on values from the set X
Let pi be the probability of each state, with i = 1, . . . , |X |
Denote the vector of probabilities with p
The entropy is maximized if p is uniform:
H(X ) ≤ log2|X |
with equality iff pi =
1
|X |
for all i
Note log2|X | is the number of bits needed to describe an outcome of X
19 / 39
1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
4 Joint Entropy, Conditional Entropy and Chain Rule
5 An Axiomatic Characterisation
6 Wrapping up
20 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3
Consider a horse race with 8 horses participating:
{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}
Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111
Note that the entropy of the corresponding random variable, say X , is:
H(X ) = 8×
1
8
log2 8 = 3 bits.
Now say that the probabilities of each horse winning are:(
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
)
What is the average code-length to transmit the identity of the winning
horse?
21 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3
Consider a horse race with 8 horses participating:
{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}
Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111
Note that the entropy of the corresponding random variable, say X , is:
H(X ) = 8×
1
8
log2 8 = 3 bits.
Now say that the probabilities of each horse winning are:(
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
)
What is the average code-length to transmit the identity of the winning
horse?
21 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3
Consider a horse race with 8 horses participating:
{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}
Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111
Note that the entropy of the corresponding random variable, say X , is:
H(X ) = 8×
1
8
log2 8 = 3 bits.
Now say that the probabilities of each horse winning are:(
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
)
What is the average code-length to transmit the identity of the winning
horse?
21 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3
Consider a horse race with 8 horses participating:
{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}
Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111
Note that the entropy of the corresponding random variable, say X , is:
H(X ) = 8×
1
8
log2 8 = 3 bits.
Now say that the probabilities of each horse winning are:(
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
)
What is the average code-length to transmit the identity of the winning
horse?
21 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3
Consider a horse race with 8 horses participating:
{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}
Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111
Note that the entropy of the corresponding random variable, say X , is:
H(X ) = 8×
1
8
log2 8 = 3 bits.
Now say that the probabilities of each horse winning are:(
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
)
What is the average code-length to transmit the identity of the winning
horse?
21 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3
We see that some horses have higher probability of winning:
We can still use a 3-bit representation
I However, this would be wasteful as some horses are more likely to win
Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes
{0, 1, 10, 11, 100, 101, 110, 111, 1000}?
Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:
{0, 10, 110, 1110, 111100, 111101, 111110, 111111}
I E.g. 11001110→??
22 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3
We see that some horses have higher probability of winning:
We can still use a 3-bit representation
I However, this would be wasteful as some horses are more likely to win
Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes
{0, 1, 10, 11, 100, 101, 110, 111, 1000}?
Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:
{0, 10, 110, 1110, 111100, 111101, 111110, 111111}
I E.g. 11001110→??
22 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3
We see that some horses have higher probability of winning:
We can still use a 3-bit representation
I However, this would be wasteful as some horses are more likely to win
Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes
{0, 1, 10, 11, 100, 101, 110, 111, 1000}?
Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:
{0, 10, 110, 1110, 111100, 111101, 111110, 111111}
I E.g. 11001110→??
22 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3
We see that some horses have higher probability of winning:
We can still use a 3-bit representation
I However, this would be wasteful as some horses are more likely to win
Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes
{0, 1, 10, 11, 100, 101, 110, 111, 1000}?
Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:
{0, 10, 110, 1110, 111100, 111101, 111110, 111111}
I E.g. 11001110→??
22 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3
We see that some horses have higher probability of winning:
We can still use a 3-bit representation
I However, this would be wasteful as some horses are more likely to win
Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes
{0, 1, 10, 11, 100, 101, 110, 111, 1000}?
Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:
{0, 10, 110, 1110, 111100, 111101, 111110, 111111}
I E.g. 11001110→??
22 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3
What is the average code length that has to be transmitted?
Average code-length =
1
2
×1+
1
4
×2+
1
8
×3+
1
16
×4+4×
1
64
×6 = 2 bits
What is the entropy of the corresponding random variable?
H(X ) = −
(
1
2
log2
1
2
+
1
4
log2
1
4
+
1
8
log2
1
8
+
1
16
log2
1
16
+
4
64
log2
1
64
)
= 2 bits
23 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3
What is the average code length that has to be transmitted?
Average code-length =
1
2
×1+
1
4
×2+
1
8
×3+
1
16
×4+4×
1
64
×6 = 2 bits
What is the entropy of the corresponding random variable?
H(X ) = −
(
1
2
log2
1
2
+
1
4
log2
1
4
+
1
8
log2
1
8
+
1
16
log2
1
16
+
4
64
log2
1
64
)
= 2 bits
23 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3
What is the average code length that has to be transmitted?
Average code-length =
1
2
×1+
1
4
×2+
1
8
×3+
1
16
×4+4×
1
64
×6 = 2 bits
What is the entropy of the corresponding random variable?
H(X ) = −
(
1
2
log2
1
2
+
1
4
log2
1
4
+
1
8
log2
1
8
+
1
16
log2
1
16
+
4
64
log2
1
64
)
= 2 bits
23 / 39
Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3
What is the average code length that has to be transmitted?
Average code-length =
1
2
×1+
1
4
×2+
1
8
×3+
1
16
×4+4×
1
64
×6 = 2 bits
What is the entropy of the corresponding random variable?
H(X ) = −
(
1
2
log2
1
2
+
1
4
log2
1
4
+
1
8
log2
1
8
+
1
16
log2
1
16
+
4
64
log2
1
64
)
= 2 bits
23 / 39
Entropy of a Random Variable:
Example 5 (from Cover & Thomas, 2006)
Let X ∈ {1, 2, 3} and p(X = 1) = p(X = 2) = p(X = 3) = 13
Given the corresponding codeword:
{
1︷︸︸︷
0 ,
2︷︸︸︷
10 ,
3︷︸︸︷
11 }
Then H(X ) = 1.58, and average code length = 1.66
In general, Entropy is a lower bound on the average number of bits to
transmit the state of a random variable.
As we shall see later, we can construct descriptors with average length
within 1 bit of the entropy.
24 / 39
Entropy of a Random Variable:
Example 5 (from Cover & Thomas, 2006)
Let X ∈ {1, 2, 3} and p(X = 1) = p(X = 2) = p(X = 3) = 13
Given the corresponding codeword:
{
1︷︸︸︷
0 ,
2︷︸︸︷
10 ,
3︷︸︸︷
11 }
Then H(X ) = 1.58, and average code length = 1.66
In general, Entropy is a lower bound on the average number of bits to
transmit the state of a random variable.
As we shall see later, we can construct descriptors with average length
within 1 bit of the entropy.
24 / 39
Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006)
Assume that only the following horses participated in the last race: {acer,
babe, cactus, daisy}.
The corresponding probabilities of winning are give by:
p(X = a) =
1
2
p(X = b) =
1
4
p(X = c) =
1
8
p(X = d) =
1
8
.
You want to determine which horse won the race with the minimum
number of yes/no questions:
(a) What binary questions should you ask?
(b) What is the minimum expected number of binary questions for this?
25 / 39
Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006) — Cont’d
As acer is more likely to have won the race I first ask about him: has
X = a won the race?
If the answer is no, I then ask about the second most probable winner: has
X = b won the race?
Then X = c?, and X = d?
Note that the series of questions corresponding to an outcome can be
seen as a code!
26 / 39
Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006) — Cont’d
X = a?
1 X = b?
1 X = c?
1 0
a 1
b 01
c 001
d 000
27 / 39
Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006) — Cont’d
The entropy of this random variable determines a lower bound for the
minimum number of binary questions:
H2(X ) = −
(
1
2
log2
1
2
+
1
4
log2
1
4
+
1
8
log2
1
8
+
1
8
log2
1
8
)
= 1.75 bits.
This is in fact the minimum expected number of binary questions. In
general, this number lies between H(X ) and H(X ) + 1
Intuitively, each question reduces our amount of uncertainty in the
outcome by attempting to eliminate (or validate) the hard to predict
outcomes
28 / 39
1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
4 Joint Entropy, Conditional Entropy and Chain Rule
5 An Axiomatic Characterisation
6 Wrapping up
29 / 39
Joint Entropy
The joint entropy H(X ,Y ) of a pair of discrete random variables with joint
distribution p(X ,Y ) is given by:
H(X ,Y ) = EX ,Y
[
log
1
p(X ,Y )
]
=
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
30 / 39
Joint Entropy:
Independent Random Variables
If X and Y are statistically independent we have that:
H(X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x)p(y)
[
log p(x) + log p(y)
]
as p(x , y) = p(x)p(y)
= −
∑
x∈X
p(x) log p(x)
∑
y∈Y
p(y)
︸ ︷︷ ︸
1
−
∑
y∈Y
p(y) log p(y)
∑
x∈X
p(x)
︸ ︷︷ ︸
1
=
∑
x∈X
p(x) log
1
p(x)
+
∑
y∈Y
p(y) log
1
p(y)
= H(X ) + H(Y )
Entropy is additive for independent random variables
31 / 39
Joint Entropy:
Independent Random Variables
If X and Y are statistically independent we have that:
H(X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x)p(y)
[
log p(x) + log p(y)
]
as p(x , y) = p(x)p(y)
= −
∑
x∈X
p(x) log p(x)
∑
y∈Y
p(y)
︸ ︷︷ ︸
1
−
∑
y∈Y
p(y) log p(y)
∑
x∈X
p(x)
︸ ︷︷ ︸
1
=
∑
x∈X
p(x) log
1
p(x)
+
∑
y∈Y
p(y) log
1
p(y)
= H(X ) + H(Y )
Entropy is additive for independent random variables
31 / 39
Joint Entropy:
Independent Random Variables
If X and Y are statistically independent we have that:
H(X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x)p(y)
[
log p(x) + log p(y)
]
as p(x , y) = p(x)p(y)
= −
∑
x∈X
p(x) log p(x)
∑
y∈Y
p(y)
︸ ︷︷ ︸
1
−
∑
y∈Y
p(y) log p(y)
∑
x∈X
p(x)
︸ ︷︷ ︸
1
=
∑
x∈X
p(x) log
1
p(x)
+
∑
y∈Y
p(y) log
1
p(y)
= H(X ) + H(Y )
Entropy is additive for independent random variables
31 / 39
Joint Entropy:
Independent Random Variables
If X and Y are statistically independent we have that:
H(X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x)p(y)
[
log p(x) + log p(y)
]
as p(x , y) = p(x)p(y)
= −
∑
x∈X
p(x) log p(x)
∑
y∈Y
p(y)
︸ ︷︷ ︸
1
−
∑
y∈Y
p(y) log p(y)
∑
x∈X
p(x)
︸ ︷︷ ︸
1
=
∑
x∈X
p(x) log
1
p(x)
+
∑
y∈Y
p(y) log
1
p(y)
= H(X ) + H(Y )
Entropy is additive for independent random variables
31 / 39
Joint Entropy:
Independent Random Variables
If X and Y are statistically independent we have that:
H(X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x)p(y)
[
log p(x) + log p(y)
]
as p(x , y) = p(x)p(y)
= −
∑
x∈X
p(x) log p(x)
∑
y∈Y
p(y)
︸ ︷︷ ︸
1
−
∑
y∈Y
p(y) log p(y)
∑
x∈X
p(x)
︸ ︷︷ ︸
1
=
∑
x∈X
p(x) log
1
p(x)
+
∑
y∈Y
p(y) log
1
p(y)
= H(X ) + H(Y )
Entropy is additive for independent random variables
31 / 39
Joint Entropy:
Independent Random Variables
If X and Y are statistically independent we have that:
H(X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x)p(y)
[
log p(x) + log p(y)
]
as p(x , y) = p(x)p(y)
= −
∑
x∈X
p(x) log p(x)
∑
y∈Y
p(y)
︸ ︷︷ ︸
1
−
∑
y∈Y
p(y) log p(y)
∑
x∈X
p(x)
︸ ︷︷ ︸
1
=
∑
x∈X
p(x) log
1
p(x)
+
∑
y∈Y
p(y) log
1
p(y)
= H(X ) + H(Y )
Entropy is additive for independent random variables
31 / 39
Conditional Entropy
The conditional entropy of Y given X = x is the entropy of the probability
distribution p(Y |X = x):
H(Y |X = x) =
∑
y∈Y
p(y |X = x) log
1
p(y |X = x)
The conditional entropy of Y given X , is the average over X of the
conditional entropy of Y given X = x :
H(Y |X ) =
∑
x∈X
p(x)H(Y |X = x)
=
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
Average uncertainty that remains about Y when X is known.
32 / 39
Conditional Entropy
The conditional entropy of Y given X = x is the entropy of the probability
distribution p(Y |X = x):
H(Y |X = x) =
∑
y∈Y
p(y |X = x) log
1
p(y |X = x)
The conditional entropy of Y given X , is the average over X of the
conditional entropy of Y given X = x :
H(Y |X ) =
∑
x∈X
p(x)H(Y |X = x)
=
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
Average uncertainty that remains about Y when X is known.
32 / 39
Conditional Entropy
The conditional entropy of Y given X = x is the entropy of the probability
distribution p(Y |X = x):
H(Y |X = x) =
∑
y∈Y
p(y |X = x) log
1
p(y |X = x)
The conditional entropy of Y given X , is the average over X of the
conditional entropy of Y given X = x :
H(Y |X ) =
∑
x∈X
p(x)H(Y |X = x)
=
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
Average uncertainty that remains about Y when X is known.
32 / 39
Conditional Entropy — Cont’d
We can re-write the conditional entropy as follows:
H(Y |X ) =
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x)p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(y |x)
= EX ,Y
[
log
1
p(Y |X )
]
Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )
33 / 39
Conditional Entropy — Cont’d
We can re-write the conditional entropy as follows:
H(Y |X ) =
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x)p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(y |x)
= EX ,Y
[
log
1
p(Y |X )
]
Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )
33 / 39
Conditional Entropy — Cont’d
We can re-write the conditional entropy as follows:
H(Y |X ) =
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x)p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(y |x)
= EX ,Y
[
log
1
p(Y |X )
]
Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )
33 / 39
Conditional Entropy — Cont’d
We can re-write the conditional entropy as follows:
H(Y |X ) =
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x)p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(y |x)
= EX ,Y
[
log
1
p(Y |X )
]
Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )
33 / 39
Conditional Entropy — Cont’d
We can re-write the conditional entropy as follows:
H(Y |X ) =
∑
x∈X
p(x)
∑
y∈Y
p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x)p(y |x) log
1
p(y |x)
=
∑
x∈X
∑
y∈Y
p(x , y) log
1
p(y |x)
= EX ,Y
[
log
1
p(Y |X )
]
Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )
33 / 39
Chain Rule
The joint entropy can be written as:
H(X ,Y ) = −
∑
x∈X
∑
y∈Y
p(x , y) log p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x , y)
[
log p(x) + log p(y |x)
]
= −
∑
x∈X
log p(x)
∑
y∈Y
p(x , y)
︸ ︷︷ ︸
p(x)
−
∑
x∈X
∑
y∈Y
p(x , y) log p(y |x)
H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X
34 / 39
Chain Rule
The joint entropy can be written as:
H(X ,Y ) = −
∑
x∈X
∑
y∈Y
p(x , y) log p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x , y)
[
log p(x) + log p(y |x)
]
= −
∑
x∈X
log p(x)
∑
y∈Y
p(x , y)
︸ ︷︷ ︸
p(x)
−
∑
x∈X
∑
y∈Y
p(x , y) log p(y |x)
H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X
34 / 39
Chain Rule
The joint entropy can be written as:
H(X ,Y ) = −
∑
x∈X
∑
y∈Y
p(x , y) log p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x , y)
[
log p(x) + log p(y |x)
]
= −
∑
x∈X
log p(x)
∑
y∈Y
p(x , y)
︸ ︷︷ ︸
p(x)
−
∑
x∈X
∑
y∈Y
p(x , y) log p(y |x)
H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X
34 / 39
Chain Rule
The joint entropy can be written as:
H(X ,Y ) = −
∑
x∈X
∑
y∈Y
p(x , y) log p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x , y)
[
log p(x) + log p(y |x)
]
= −
∑
x∈X
log p(x)
∑
y∈Y
p(x , y)
︸ ︷︷ ︸
p(x)
−
∑
x∈X
∑
y∈Y
p(x , y) log p(y |x)
H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X
34 / 39
Chain Rule
The joint entropy can be written as:
H(X ,Y ) = −
∑
x∈X
∑
y∈Y
p(x , y) log p(x , y)
= −
∑
x∈X
∑
y∈Y
p(x , y)
[
log p(x) + log p(y |x)
]
= −
∑
x∈X
log p(x)
∑
y∈Y
p(x , y)
︸ ︷︷ ︸
p(x)
−
∑
x∈X
∑
y∈Y
p(x , y) log p(y |x)
H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X
34 / 39
1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
4 Joint Entropy, Conditional Entropy and Chain Rule
5 An Axiomatic Characterisation
6 Wrapping up
35 / 39
An Axiomatic Characterisation
Suppose we want a measure H of “information” in a random variable X
such that
1 H depends on the distribution of X , and not the outcomes themselves
2 The H for the combination of two variables X ,Y is at most the sum of
the corresponding H values
3 The H for the combination of two independent variables X ,Y is the
sum of the corresponding H values
4 Adding outcomes with probability zero does not affect H
5 The H for an unbiased Bernoulli is 1
6 The H for a Bernoulli with parameter p tends to 0 as p → 0
Then, the only possible choice for H is
H(X ) = −
∑
x
p(x) log2 p(x)
36 / 39
Outline
1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
4 Joint Entropy, Conditional Entropy and Chain Rule
5 An Axiomatic Characterisation
6 Wrapping up
37 / 39
Summary
Entropy as a measure of information content
Computation of entropy of discrete random variables
Entropy and average code length
Entropy and minimum expected number of binary questions
Joint and conditional entropies, chain rule
Reading: Mackay § 1.2 – § 1.5, § 8.1; Cover & Thomas § 2.1;
Bishop § 1.6
38 / 39
Next time
More properties of entropy
Relative entropy
Mutual information
39 / 39
Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties
Examples: Bernoulli and Categorical Random Variables
Maximum Entropy
Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions
Joint Entropy, Conditional Entropy and Chain Rule
An Axiomatic Characterisation
Wrapping up