程序代写代做代考 information theory Bayesian chain COMP2610 / COMP6261 – Information Theory – Lecture 6: Entropy

COMP2610 / COMP6261 – Information Theory – Lecture 6: Entropy

COMP2610 / COMP6261 – Information Theory
Lecture 6: Entropy

Robert Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

7 August 2018

1 / 39

Last time

The Bernoulli and Binomial distributions

Maximum likelihood estimation

Bayesian parameter etimation

2 / 39

This time

Information content and entropy

Examples and intuition

Some basic properties of entropy

3 / 39

Outline

1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

4 Joint Entropy, Conditional Entropy and Chain Rule

5 An Axiomatic Characterisation

6 Wrapping up

4 / 39

Recap: A General Communication System

How informative is a message?

5 / 39

Information Content: Informally

Say that a message comprises a single bit (one binary random variable)

Whether or not a coin comes up heads

Whether or not my favourite horse wins a race

Informally, the amount of information in such a message is:
How unexpected or “surprising” an outcome of random variable is

I If a coin comes up Heads 99.99% of the time, the message “Tails” is
much more informative than “Heads”

I If I believe my favourite horse will win with 99.99% probability, it is
surprising to find out it did not

How predictable a random variable is
I If a coin comes up Heads 99.99% of the time, we can predict the next

message as “Heads” and be right most of the time

I If I believe my favourite horse will win with 99.99% probability, then I
believe predicting so to be right most of the time

6 / 39

Information Content: Informally

Say that a message comprises a single bit (one binary random variable)

Whether or not a coin comes up heads

Whether or not my favourite horse wins a race

Informally, the amount of information in such a message is:
How unexpected or “surprising” an outcome of random variable is

I If a coin comes up Heads 99.99% of the time, the message “Tails” is
much more informative than “Heads”

I If I believe my favourite horse will win with 99.99% probability, it is
surprising to find out it did not

How predictable a random variable is
I If a coin comes up Heads 99.99% of the time, we can predict the next

message as “Heads” and be right most of the time

I If I believe my favourite horse will win with 99.99% probability, then I
believe predicting so to be right most of the time

6 / 39

Information Content: Formally

Intuitively, we measure information of a message in relation to the other
messages we could have seen

For binary messages, we either see 0 or 1

The message 1 is informative when there is a good chance I might
have seen 0

How can we formalise and thus measure information content?

Information content of an outcome must depend on its probability

Information content of a random variable must depend on its
probability distribution

7 / 39

Information Content of an Outcome: Definition

Let X be a discrete r.v. with possible outcomes X
e.g. X = {0, 1}

e.g. X = {Yes, No, Maybe}

Let p(x) denote the probability of outcome x ∈ X

The information content of an outcome x ∈ X is:

h(x) = log2
1

p(x)

8 / 39

Information Content of an Outcome: Properties

The information content of x grows as p(x) shrinks

Outcomes that are rare are deemed to contain more information

Choice of logarithm basis is arbitrary

If we use log2 we measure information in bits

What about other functions of p(x), e.g. 1p(x)2 − 1?

9 / 39

Entropy of a Random Variable: Definition

Let X be a discrete r.v. with possible outcomes X .

The entropy of the random variable X is the average information content of
the outcomes:

H(X ) = Ex [h(x)]

=

x

p(x) · log2
1

p(x)

= −

x

p(x) log2 p(x)

where we define 0 log 0 ≡ 0, as limp→0 p log p = 0.

10 / 39

Entropy of a Random Variable
Some Basic Properties

Non-negativity:
0 ≤ p(x) ≤ 1⇒ log

1
p(x)

≥ 0


x

p(x) log
1

p(x)
≥ 0

⇒ H(X )≥ 0

Change of base:
Hb(X ) = −


x

p(x) logb p(x)

=

x

p(x) loga p(x) logb a

Hb(X )= logb a Ha(X )

I If we use log2 the units are called bits
I If we use natural logarithm the units are called nats

11 / 39

Entropy of a Random Variable
Some Basic Properties

Non-negativity:
0 ≤ p(x) ≤ 1⇒ log

1
p(x)

≥ 0


x

p(x) log
1

p(x)
≥ 0

⇒ H(X )≥ 0

Change of base:
Hb(X ) = −


x

p(x) logb p(x)

=

x

p(x) loga p(x) logb a

Hb(X )= logb a Ha(X )

I If we use log2 the units are called bits
I If we use natural logarithm the units are called nats

11 / 39

Unrolling the Definition

The entropy of X is

H(X ) = −

x

p(x)log2 p(x).

Pick a random outcome x , and see how large its probability is
Average information content of each outcome

Does not depend on the values of the outcomes
Only on their probabilities

Contrast with expectation E[X ] =

x x · p(X = x).

12 / 39

What Does Entropy “Mean”?

Not a well posed question.
Entropy does match some intuitive properties of our informal notion of
“information content”

Rare outcomes provide more information

But other functions also seem plausible, e.g.

G(X ) =

x

p(x)
1

p(x)2
=

x

1
p(x)

.

We will see some examples where our definition of entropy arises naturally
The main justification is the results we can obtain with it.

13 / 39

1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

4 Joint Entropy, Conditional Entropy and Chain Rule

5 An Axiomatic Characterisation

6 Wrapping up

14 / 39

Entropy of a Random Variable
Example 1 — Bernoulli Distribution

Let X ∈ {0, 1} with X ∼ Bern(X |θ)

Then,

p(X = 0) = 1− θ
p(X = 1) = θ

So, the entropy of a Bernoulli random variable is

H(X ) = −

x∈{0,1}

p(x) · log p(x)

= −θ log θ − (1− θ) log(1− θ)

15 / 39

Entropy of a Random Variable
Example 1 — Bernoulli Distribution

Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)

H(X ) = −θ log θ − (1− θ) log(1− θ)

0 0.5 1
0

0.2

0.4

0.6

0.8

1

θ = p(X=1)

H
2

)

Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.

16 / 39

Entropy of a Random Variable
Example 1 — Bernoulli Distribution

Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)

H(X ) = −θ log θ − (1− θ) log(1− θ)

0 0.5 1
0

0.2

0.4

0.6

0.8

1

θ = p(X=1)

H
2

)

Concave function of the distribution

Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.

16 / 39

Entropy of a Random Variable
Example 1 — Bernoulli Distribution

Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)

H(X ) = −θ log θ − (1− θ) log(1− θ)

0 0.5 1
0

0.2

0.4

0.6

0.8

1

θ = p(X=1)

H
2

)

Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0

Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.

16 / 39

Entropy of a Random Variable
Example 1 — Bernoulli Distribution

Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)

H(X ) = −θ log θ − (1− θ) log(1− θ)

0 0.5 1
0

0.2

0.4

0.6

0.8

1

θ = p(X=1)

H
2

)

Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5

For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.

16 / 39

Entropy of a Random Variable
Example 1 — Bernoulli Distribution

Let X ∈ {0, 1} with X ∼ Bern(X |θ) and θ = p(X = 1)

H(X ) = −θ log θ − (1− θ) log(1− θ)

0 0.5 1
0

0.2

0.4

0.6

0.8

1

θ = p(X=1)

H
2

)

Concave function of the distribution
Minimum entropy→ no uncertainty about X , i.e. θ = 1 or θ = 0
Maximum when→ complete uncertainty about X , i.e. θ = 0.5
For θ = 0.5 (e.g. a fair coin) H2(X ) = 1 bit.

16 / 39

Entropy of a Random Variable
Example 2

Consider a random variable X with uniform distribution over 32 outcomes:

The entropy of this rv is given by:

H(X ) = −
32∑

i=1

p(i) log2 p(i) = −
32∑

i=1

1
32

log2
1
32

= log2 32 = 5 bits.

17 / 39

Entropy of a Random Variable
Example 3 — Categorical Distribution

Categorical distributions with 30 different states:

Figure from Bishop, PRML, 2006)

The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats

I When will the entropy be minimum?

18 / 39

Entropy of a Random Variable
Example 3 — Categorical Distribution

Categorical distributions with 30 different states:

Figure from Bishop, PRML, 2006)

The more sharply peaked the lower the entropy

The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats

I When will the entropy be minimum?

18 / 39

Entropy of a Random Variable
Example 3 — Categorical Distribution

Categorical distributions with 30 different states:

Figure from Bishop, PRML, 2006)

The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy

Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats
I When will the entropy be minimum?

18 / 39

Entropy of a Random Variable
Example 3 — Categorical Distribution

Categorical distributions with 30 different states:

Figure from Bishop, PRML, 2006)

The more sharply peaked the lower the entropy
The more evenly spread the higher the entropy
Maximum for uniform distribution: H(X ) = − log 130 ≈ 3.40 nats

I When will the entropy be minimum?
18 / 39

Entropy of a Random Variable
Maximum Entropy

Consider a discrete variable X taking on values from the set X
Let pi be the probability of each state, with i = 1, . . . , |X |

Denote the vector of probabilities with p

The entropy is maximized if p is uniform:

H(X ) ≤ log2|X |

with equality iff pi =
1
|X |

for all i

Note log2|X | is the number of bits needed to describe an outcome of X

19 / 39

Entropy of a Random Variable
Maximum Entropy

Consider a discrete variable X taking on values from the set X
Let pi be the probability of each state, with i = 1, . . . , |X |

Denote the vector of probabilities with p

The entropy is maximized if p is uniform:

H(X ) ≤ log2|X |

with equality iff pi =
1
|X |

for all i

Note log2|X | is the number of bits needed to describe an outcome of X

19 / 39

Entropy of a Random Variable
Maximum Entropy

Consider a discrete variable X taking on values from the set X
Let pi be the probability of each state, with i = 1, . . . , |X |

Denote the vector of probabilities with p

The entropy is maximized if p is uniform:

H(X ) ≤ log2|X |

with equality iff pi =
1
|X |

for all i

Note log2|X | is the number of bits needed to describe an outcome of X

19 / 39

1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

4 Joint Entropy, Conditional Entropy and Chain Rule

5 An Axiomatic Characterisation

6 Wrapping up

20 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3

Consider a horse race with 8 horses participating:

{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}

Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111

Note that the entropy of the corresponding random variable, say X , is:

H(X ) = 8×
1
8
log2 8 = 3 bits.

Now say that the probabilities of each horse winning are:(
1
2
,

1
4
,

1
8
,

1
16
,

1
64
,

1
64
,

1
64
,

1
64

)

What is the average code-length to transmit the identity of the winning
horse?

21 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3

Consider a horse race with 8 horses participating:

{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}

Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111

Note that the entropy of the corresponding random variable, say X , is:

H(X ) = 8×
1
8
log2 8 = 3 bits.

Now say that the probabilities of each horse winning are:(
1
2
,

1
4
,

1
8
,

1
16
,

1
64
,

1
64
,

1
64
,

1
64

)

What is the average code-length to transmit the identity of the winning
horse?

21 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3

Consider a horse race with 8 horses participating:

{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}

Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111

Note that the entropy of the corresponding random variable, say X , is:

H(X ) = 8×
1
8
log2 8 = 3 bits.

Now say that the probabilities of each horse winning are:(
1
2
,

1
4
,

1
8
,

1
16
,

1
64
,

1
64
,

1
64
,

1
64

)

What is the average code-length to transmit the identity of the winning
horse?

21 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3

Consider a horse race with 8 horses participating:

{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}

Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111

Note that the entropy of the corresponding random variable, say X , is:

H(X ) = 8×
1
8
log2 8 = 3 bits.

Now say that the probabilities of each horse winning are:(
1
2
,

1
4
,

1
8
,

1
16
,

1
64
,

1
64
,

1
64
,

1
64

)

What is the average code-length to transmit the identity of the winning
horse?

21 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 1 of 3

Consider a horse race with 8 horses participating:

{acer , babe, cactus, daisy , epic, f ancy , gem, hairy}

Each horse is equally likely to win. How many bits will we need to
transmit the identity of the winning horse? 000, 001, 010, . . ., 111

Note that the entropy of the corresponding random variable, say X , is:

H(X ) = 8×
1
8
log2 8 = 3 bits.

Now say that the probabilities of each horse winning are:(
1
2
,

1
4
,

1
8
,

1
16
,

1
64
,

1
64
,

1
64
,

1
64

)
What is the average code-length to transmit the identity of the winning
horse?

21 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3

We see that some horses have higher probability of winning:
We can still use a 3-bit representation

I However, this would be wasteful as some horses are more likely to win

Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes

{0, 1, 10, 11, 100, 101, 110, 111, 1000}?

Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:

{0, 10, 110, 1110, 111100, 111101, 111110, 111111}

I E.g. 11001110→??

22 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3

We see that some horses have higher probability of winning:
We can still use a 3-bit representation

I However, this would be wasteful as some horses are more likely to win

Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.

Let us try representing the horses (states) using the following codes

{0, 1, 10, 11, 100, 101, 110, 111, 1000}?

Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:

{0, 10, 110, 1110, 111100, 111101, 111110, 111111}

I E.g. 11001110→??

22 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3

We see that some horses have higher probability of winning:
We can still use a 3-bit representation

I However, this would be wasteful as some horses are more likely to win

Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes

{0, 1, 10, 11, 100, 101, 110, 111, 1000}?

Decode 010 into ’aba’ or ’ac’? Ambiguous.

We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:

{0, 10, 110, 1110, 111100, 111101, 111110, 111111}

I E.g. 11001110→??

22 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3

We see that some horses have higher probability of winning:
We can still use a 3-bit representation

I However, this would be wasteful as some horses are more likely to win

Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes

{0, 1, 10, 11, 100, 101, 110, 111, 1000}?

Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.

Represent the horses (states) using the following codes:

{0, 10, 110, 1110, 111100, 111101, 111110, 111111}

I E.g. 11001110→??

22 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 2 of 3

We see that some horses have higher probability of winning:
We can still use a 3-bit representation

I However, this would be wasteful as some horses are more likely to win

Idea: Use shorter codes for most probable horses and longer codes
for the less probable horses.
Let us try representing the horses (states) using the following codes

{0, 1, 10, 11, 100, 101, 110, 111, 1000}?

Decode 010 into ’aba’ or ’ac’? Ambiguous.
We should be able to disambiguate a concatenation of these strings
into the corresponding components.
Represent the horses (states) using the following codes:

{0, 10, 110, 1110, 111100, 111101, 111110, 111111}

I E.g. 11001110→??
22 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3

What is the average code length that has to be transmitted?

Average code-length =
1
2
×1+

1
4
×2+

1
8
×3+

1
16
×4+4×

1
64
×6 = 2 bits

What is the entropy of the corresponding random variable?

H(X ) = −
(

1
2
log2

1
2
+

1
4
log2

1
4
+

1
8
log2

1
8
+

1
16

log2
1

16
+

4
64

log2
1
64

)
= 2 bits

23 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3

What is the average code length that has to be transmitted?

Average code-length =
1
2
×1+

1
4
×2+

1
8
×3+

1
16
×4+4×

1
64
×6 = 2 bits

What is the entropy of the corresponding random variable?

H(X ) = −
(

1
2
log2

1
2
+

1
4
log2

1
4
+

1
8
log2

1
8
+

1
16

log2
1

16
+

4
64

log2
1
64

)
= 2 bits

23 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3

What is the average code length that has to be transmitted?

Average code-length =
1
2
×1+

1
4
×2+

1
8
×3+

1
16
×4+4×

1
64
×6 = 2 bits

What is the entropy of the corresponding random variable?

H(X ) = −
(

1
2
log2

1
2
+

1
4
log2

1
4
+

1
8
log2

1
8
+

1
16

log2
1

16
+

4
64

log2
1
64

)
= 2 bits

23 / 39

Entropy of a Random Variable
Example 4 (from Cover & Thomas, 2006) — 3 of 3

What is the average code length that has to be transmitted?

Average code-length =
1
2
×1+

1
4
×2+

1
8
×3+

1
16
×4+4×

1
64
×6 = 2 bits

What is the entropy of the corresponding random variable?

H(X ) = −
(

1
2
log2

1
2
+

1
4
log2

1
4
+

1
8
log2

1
8
+

1
16

log2
1
16

+
4
64

log2
1
64

)
= 2 bits

23 / 39

Entropy of a Random Variable:
Example 5 (from Cover & Thomas, 2006)

Let X ∈ {1, 2, 3} and p(X = 1) = p(X = 2) = p(X = 3) = 13
Given the corresponding codeword:

{
1︷︸︸︷
0 ,

2︷︸︸︷
10 ,

3︷︸︸︷
11 }

Then H(X ) = 1.58, and average code length = 1.66

In general, Entropy is a lower bound on the average number of bits to
transmit the state of a random variable.

As we shall see later, we can construct descriptors with average length
within 1 bit of the entropy.

24 / 39

Entropy of a Random Variable:
Example 5 (from Cover & Thomas, 2006)

Let X ∈ {1, 2, 3} and p(X = 1) = p(X = 2) = p(X = 3) = 13
Given the corresponding codeword:

{
1︷︸︸︷
0 ,

2︷︸︸︷
10 ,

3︷︸︸︷
11 }

Then H(X ) = 1.58, and average code length = 1.66

In general, Entropy is a lower bound on the average number of bits to
transmit the state of a random variable.

As we shall see later, we can construct descriptors with average length
within 1 bit of the entropy.

24 / 39

Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006)

Assume that only the following horses participated in the last race: {acer,
babe, cactus, daisy}.

The corresponding probabilities of winning are give by:

p(X = a) =
1
2

p(X = b) =
1
4

p(X = c) =
1
8

p(X = d) =
1
8

.

You want to determine which horse won the race with the minimum
number of yes/no questions:

(a) What binary questions should you ask?

(b) What is the minimum expected number of binary questions for this?

25 / 39

Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006) — Cont’d

As acer is more likely to have won the race I first ask about him: has
X = a won the race?

If the answer is no, I then ask about the second most probable winner: has
X = b won the race?

Then X = c?, and X = d?

Note that the series of questions corresponding to an outcome can be
seen as a code!

26 / 39

Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006) — Cont’d

X = a?

1 X = b?

1 X = c?

1 0

a 1
b 01
c 001
d 000

27 / 39

Entropy of a Random Variable:
What Questions Should We Ask? (From Cover & Thomas, 2006) — Cont’d

The entropy of this random variable determines a lower bound for the
minimum number of binary questions:

H2(X ) = −
(

1
2
log2

1
2
+

1
4
log2

1
4
+

1
8
log2

1
8
+

1
8
log2

1
8

)
= 1.75 bits.

This is in fact the minimum expected number of binary questions. In
general, this number lies between H(X ) and H(X ) + 1

Intuitively, each question reduces our amount of uncertainty in the
outcome by attempting to eliminate (or validate) the hard to predict
outcomes

28 / 39

1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

4 Joint Entropy, Conditional Entropy and Chain Rule

5 An Axiomatic Characterisation

6 Wrapping up

29 / 39

Joint Entropy

The joint entropy H(X ,Y ) of a pair of discrete random variables with joint
distribution p(X ,Y ) is given by:

H(X ,Y ) = EX ,Y
[
log

1
p(X ,Y )

]
=

x∈X


y∈Y

p(x , y) log
1

p(x , y)

30 / 39

Joint Entropy:
Independent Random Variables

If X and Y are statistically independent we have that:

H(X ,Y ) =

x∈X


y∈Y

p(x , y) log
1

p(x , y)

= −

x∈X


y∈Y

p(x)p(y)
[
log p(x) + log p(y)

]
as p(x , y) = p(x)p(y)

= −

x∈X

p(x) log p(x)

y∈Y

p(y)

︸ ︷︷ ︸
1



y∈Y

p(y) log p(y)

x∈X

p(x)

︸ ︷︷ ︸
1

=

x∈X

p(x) log
1

p(x)
+

y∈Y

p(y) log
1

p(y)

= H(X ) + H(Y )

Entropy is additive for independent random variables

31 / 39

Joint Entropy:
Independent Random Variables

If X and Y are statistically independent we have that:

H(X ,Y ) =

x∈X


y∈Y

p(x , y) log
1

p(x , y)

= −

x∈X


y∈Y

p(x)p(y)
[
log p(x) + log p(y)

]
as p(x , y) = p(x)p(y)

= −

x∈X

p(x) log p(x)

y∈Y

p(y)

︸ ︷︷ ︸
1



y∈Y

p(y) log p(y)

x∈X

p(x)

︸ ︷︷ ︸
1

=

x∈X

p(x) log
1

p(x)
+

y∈Y

p(y) log
1

p(y)

= H(X ) + H(Y )

Entropy is additive for independent random variables

31 / 39

Joint Entropy:
Independent Random Variables

If X and Y are statistically independent we have that:

H(X ,Y ) =

x∈X


y∈Y

p(x , y) log
1

p(x , y)

= −

x∈X


y∈Y

p(x)p(y)
[
log p(x) + log p(y)

]
as p(x , y) = p(x)p(y)

= −

x∈X

p(x) log p(x)

y∈Y

p(y)

︸ ︷︷ ︸
1



y∈Y

p(y) log p(y)

x∈X

p(x)

︸ ︷︷ ︸
1

=

x∈X

p(x) log
1

p(x)
+

y∈Y

p(y) log
1

p(y)

= H(X ) + H(Y )

Entropy is additive for independent random variables

31 / 39

Joint Entropy:
Independent Random Variables

If X and Y are statistically independent we have that:

H(X ,Y ) =

x∈X


y∈Y

p(x , y) log
1

p(x , y)

= −

x∈X


y∈Y

p(x)p(y)
[
log p(x) + log p(y)

]
as p(x , y) = p(x)p(y)

= −

x∈X

p(x) log p(x)

y∈Y

p(y)

︸ ︷︷ ︸
1



y∈Y

p(y) log p(y)

x∈X

p(x)

︸ ︷︷ ︸
1

=

x∈X

p(x) log
1

p(x)
+

y∈Y

p(y) log
1

p(y)

= H(X ) + H(Y )

Entropy is additive for independent random variables

31 / 39

Joint Entropy:
Independent Random Variables

If X and Y are statistically independent we have that:

H(X ,Y ) =

x∈X


y∈Y

p(x , y) log
1

p(x , y)

= −

x∈X


y∈Y

p(x)p(y)
[
log p(x) + log p(y)

]
as p(x , y) = p(x)p(y)

= −

x∈X

p(x) log p(x)

y∈Y

p(y)

︸ ︷︷ ︸
1



y∈Y

p(y) log p(y)

x∈X

p(x)

︸ ︷︷ ︸
1

=

x∈X

p(x) log
1

p(x)
+

y∈Y

p(y) log
1

p(y)

= H(X ) + H(Y )

Entropy is additive for independent random variables

31 / 39

Joint Entropy:
Independent Random Variables

If X and Y are statistically independent we have that:

H(X ,Y ) =

x∈X


y∈Y

p(x , y) log
1

p(x , y)

= −

x∈X


y∈Y

p(x)p(y)
[
log p(x) + log p(y)

]
as p(x , y) = p(x)p(y)

= −

x∈X

p(x) log p(x)

y∈Y

p(y)

︸ ︷︷ ︸
1



y∈Y

p(y) log p(y)

x∈X

p(x)

︸ ︷︷ ︸
1

=

x∈X

p(x) log
1

p(x)
+

y∈Y

p(y) log
1

p(y)

= H(X ) + H(Y )

Entropy is additive for independent random variables

31 / 39

Conditional Entropy

The conditional entropy of Y given X = x is the entropy of the probability
distribution p(Y |X = x):

H(Y |X = x) =

y∈Y

p(y |X = x) log
1

p(y |X = x)

The conditional entropy of Y given X , is the average over X of the
conditional entropy of Y given X = x :

H(Y |X ) =

x∈X

p(x)H(Y |X = x)

=

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

Average uncertainty that remains about Y when X is known.

32 / 39

Conditional Entropy

The conditional entropy of Y given X = x is the entropy of the probability
distribution p(Y |X = x):

H(Y |X = x) =

y∈Y

p(y |X = x) log
1

p(y |X = x)

The conditional entropy of Y given X , is the average over X of the
conditional entropy of Y given X = x :

H(Y |X ) =

x∈X

p(x)H(Y |X = x)

=

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

Average uncertainty that remains about Y when X is known.

32 / 39

Conditional Entropy

The conditional entropy of Y given X = x is the entropy of the probability
distribution p(Y |X = x):

H(Y |X = x) =

y∈Y

p(y |X = x) log
1

p(y |X = x)

The conditional entropy of Y given X , is the average over X of the
conditional entropy of Y given X = x :

H(Y |X ) =

x∈X

p(x)H(Y |X = x)

=

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

Average uncertainty that remains about Y when X is known.

32 / 39

Conditional Entropy — Cont’d

We can re-write the conditional entropy as follows:

H(Y |X ) =

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x)p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x , y) log
1

p(y |x)

= EX ,Y
[
log

1
p(Y |X )

]

Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )

33 / 39

Conditional Entropy — Cont’d

We can re-write the conditional entropy as follows:

H(Y |X ) =

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x)p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x , y) log
1

p(y |x)

= EX ,Y
[
log

1
p(Y |X )

]

Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )

33 / 39

Conditional Entropy — Cont’d

We can re-write the conditional entropy as follows:

H(Y |X ) =

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x)p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x , y) log
1

p(y |x)

= EX ,Y
[
log

1
p(Y |X )

]

Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )

33 / 39

Conditional Entropy — Cont’d

We can re-write the conditional entropy as follows:

H(Y |X ) =

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x)p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x , y) log
1

p(y |x)

= EX ,Y
[
log

1
p(Y |X )

]

Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )

33 / 39

Conditional Entropy — Cont’d

We can re-write the conditional entropy as follows:

H(Y |X ) =

x∈X

p(x)

y∈Y

p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x)p(y |x) log
1

p(y |x)

=

x∈X


y∈Y

p(x , y) log
1

p(y |x)

= EX ,Y
[
log

1
p(Y |X )

]

Note the expectation is not wrt the conditional distribution but wrt the joint
distribution p(X ,Y )

33 / 39

Chain Rule

The joint entropy can be written as:

H(X ,Y ) = −

x∈X


y∈Y

p(x , y) log p(x , y)

= −

x∈X


y∈Y

p(x , y)
[
log p(x) + log p(y |x)

]
= −


x∈X

log p(x)

y∈Y

p(x , y)

︸ ︷︷ ︸
p(x)



x∈X


y∈Y

p(x , y) log p(y |x)

H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X

34 / 39

Chain Rule

The joint entropy can be written as:

H(X ,Y ) = −

x∈X


y∈Y

p(x , y) log p(x , y)

= −

x∈X


y∈Y

p(x , y)
[
log p(x) + log p(y |x)

]

= −

x∈X

log p(x)

y∈Y

p(x , y)

︸ ︷︷ ︸
p(x)



x∈X


y∈Y

p(x , y) log p(y |x)

H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X

34 / 39

Chain Rule

The joint entropy can be written as:

H(X ,Y ) = −

x∈X


y∈Y

p(x , y) log p(x , y)

= −

x∈X


y∈Y

p(x , y)
[
log p(x) + log p(y |x)

]
= −


x∈X

log p(x)

y∈Y

p(x , y)

︸ ︷︷ ︸
p(x)



x∈X


y∈Y

p(x , y) log p(y |x)

H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X

34 / 39

Chain Rule

The joint entropy can be written as:

H(X ,Y ) = −

x∈X


y∈Y

p(x , y) log p(x , y)

= −

x∈X


y∈Y

p(x , y)
[
log p(x) + log p(y |x)

]
= −


x∈X

log p(x)

y∈Y

p(x , y)

︸ ︷︷ ︸
p(x)



x∈X


y∈Y

p(x , y) log p(y |x)

H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X

34 / 39

Chain Rule

The joint entropy can be written as:

H(X ,Y ) = −

x∈X


y∈Y

p(x , y) log p(x , y)

= −

x∈X


y∈Y

p(x , y)
[
log p(x) + log p(y |x)

]
= −


x∈X

log p(x)

y∈Y

p(x , y)

︸ ︷︷ ︸
p(x)



x∈X


y∈Y

p(x , y) log p(y |x)

H(X ,Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y )

The joint uncertainty of X and Y is the uncertainty of X plus the
uncertainty of Y given X

34 / 39

1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

4 Joint Entropy, Conditional Entropy and Chain Rule

5 An Axiomatic Characterisation

6 Wrapping up

35 / 39

An Axiomatic Characterisation

Suppose we want a measure H of “information” in a random variable X
such that

1 H depends on the distribution of X , and not the outcomes themselves
2 The H for the combination of two variables X ,Y is at most the sum of

the corresponding H values
3 The H for the combination of two independent variables X ,Y is the

sum of the corresponding H values
4 Adding outcomes with probability zero does not affect H
5 The H for an unbiased Bernoulli is 1
6 The H for a Bernoulli with parameter p tends to 0 as p → 0

Then, the only possible choice for H is

H(X ) = −

x

p(x) log2 p(x)

36 / 39

Outline

1 Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

2 Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

3 Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

4 Joint Entropy, Conditional Entropy and Chain Rule

5 An Axiomatic Characterisation

6 Wrapping up

37 / 39

Summary

Entropy as a measure of information content

Computation of entropy of discrete random variables

Entropy and average code length

Entropy and minimum expected number of binary questions

Joint and conditional entropies, chain rule

Reading: Mackay § 1.2 – § 1.5, § 8.1; Cover & Thomas § 2.1;
Bishop § 1.6

38 / 39

Next time

More properties of entropy

Relative entropy

Mutual information

39 / 39

Information Content & Entropy
Entropy of a Random Variable
Some Basic Properties

Examples: Bernoulli and Categorical Random Variables
Maximum Entropy

Entropy as Code Length
Average Code Length
Minimum Number of Binary Questions

Joint Entropy, Conditional Entropy and Chain Rule
An Axiomatic Characterisation
Wrapping up