程序代写代做代考 information theory chain COMP2610 / COMP6261 – Information Theory – Lecture 7: Relative Entropy and Mutual Information

COMP2610 / COMP6261 – Information Theory – Lecture 7: Relative Entropy and Mutual Information

COMP2610 / COMP6261 – Information Theory
Lecture 7: Relative Entropy and Mutual Information

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

13 August 2018

1 / 31

Last time

Information content and entropy: definition and computation

Entropy and average code length

Entropy and minimum expected number of binary questions

Joint and conditional entropies, chain rule

2 / 31

Information Content: Review

Let X be a random variable with outcomes in X

Let p(x) denote the probability of the outcome x ∈ X

The (Shannon) information content of outcome x is

h(x) = log2
1

p(x)

As p(x)→ 0, h(x)→ +∞ (rare outcomes are more informative)

3 / 31

Entropy: Review

The entropy is the average information content of all outcomes:

H(X ) =

x

p(x)log2
1

p(x)

Entropy is minimised if p is peaked, and maximized if p is uniform:

0 ≤ H(X ) ≤ log|X |

Entropy is related to minimal number of bits needed to describe a random
variable

4 / 31

This time

The decomposability property of entropy

Relative entropy and divergences

Mutual information

5 / 31

Outline

1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

6 / 31

Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {1, 2, 3} be a r.v. created by the following process:
1 Flip a fair coin to determine whether X = 1

2 If X 6= 1 flip another fair coin to determine whether X = 2 or X = 3

The probability distribution of X is given by:

p(X = 1) = 12

p(X = 2) = 14

p(X = 3) = 14

7 / 31

Decomposability of Entropy
Example 1 (Mackay, 2003) — Cont’d

By definition,

H(X ) =
1
2
log 2 +

1
4
log 4 +

1
4
log 4 = 1.5 bits.

But imagine learning the value of X gradually:
1 First we learn whether X = 1:

I Binary variable with p(1) = ( 12 ,
1
2 )

I Hence H(1/2, 1/2) = log2 2 = 1 bit.
2 If X 6= 1 we learn the value of the second coin flip:

I Also binary variable with p(2) = ( 12 ,
1
2 )

I Therefore H(1/2, 1/2) = 1 bit.

However, the second revelation only happens half of the time:

H(X ) = H(1/2, 1/2) +
1
2

H(1/2, 1/2) = 1.5 bits.

8 / 31

Decomposability of Entropy
Generalization

For a r.v. with probability distribution p = (p1, . . . , p|X |):

H(p) = H(p1, 1− p1) + (1− p1)H
(

p2
1− p1

, . . . ,
p|X |

1− p1

)
H(p1, 1− p1): entropy for a random variable corresponding to “Is X = 1?”

1− p1: probability of X 6= 1

p2
1−p1

, . . . ,
p|X|
1−p1

: conditional probability of X = 2, . . . , |X | given X 6= 1.

H
(

p2
1−p1

, . . . ,
p|X|
1−p1

)
: entropy for a random variable corresponding to

outcomes when X 6= 1.

9 / 31

Decomposability of Entropy
Generalization

In general, we have that for any m between 1 and |X | − 1:

H(p) =H


 m∑

i=1

pi ,
|X |∑

i=m+1

pi


+

(
m∑

i=1

pi

)
H
(

p1∑m
i=1 pi

, . . . ,
pm∑m
i=1 pi

)

+


 |X |∑

i=m+1

pi


H

(
pm+1∑|X |
i=m+1 pi

, . . . ,
p|X |∑|X |

i=m+1 pi

)

Apply this formula with m = 1, |X | = 3, p = (p1, p2, p3) = (1/2, 1/4, 1/4)

10 / 31

1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

11 / 31

Entropy in Information Theory

If a random variable has distribution p, there exists an encoding with an
average length of

H(p) bits

and this is the “best” possible encoding

What happens if we use a “wrong” encoding?
e.g. because we make an incorrect assumption on the probability distribution

If the true distribution is p, but we assume it is q, it turns out we will need
to use

H(p) + DKL(p‖q) bits

where DKL(p‖q) is some measure of “distance” between p and q

12 / 31

Relative Entropy

Definition
The relative entropy or Kullback-Leibler (KL) divergence between two
probability distributions p(X ) and q(X ) is defined as:

DKL(p‖q) =

x∈X

p(x)
(
log

1
q(x)

− log
1

p(x)

)
=

x∈X

p(x) log
p(x)
q(x)

= Ep(X)
[
log

p(X )
q(X )

]
.

Note:
I Both p(X ) and q(X ) are defined over the same alphabet X

Conventions on log likelihood ratio:

0 log
0
0

def
= 0 0 log

0
q

def
= 0 p log

p
0

def
=∞

13 / 31

Relative Entropy
Properties

DKL(p‖q) ≥ 0 (proof next lecture)

DKL(p‖q) = 0⇔ p = q (proof next lecture)

Not symmetric: DKL(p‖q) 6= DKL(q‖p)

Not satisfy triangle inequality: DKL(p‖q) 6= DKL(p‖r) + DKL(r‖q)
I Not a true distance since is not symmetric and does not satisfy the

triangle inequality

I Hence, “KL divergence” rather than “KL distance”

14 / 31

Relative Entropy
Uniform q

Let q correspond to a uniform distribution: q(x) = 1|X |

Relative entropy between p and q:

DKL(p‖q) =

x∈X

p(x) log
p(x)
q(x)

=

x∈X

p(x) · (log p(x) + log |X |)

= −H(X ) +

x∈X

p(x) · log |X |

= −H(X ) + log |X |.

Matches intuition as penalty on number of bits for encoding

15 / 31

Relative Entropy
Example (from Cover & Thomas, 2006)

Let X ∈ {0, 1} and consider the distributions p(X ) and q(X ) such that:

p(X = 1) = θp p(X = 0) = 1− θp
q(X = 1) = θq q(X = 0) = 1− θq

What distributions are these?

Compute DKL(p‖q) and DKL(q‖p) with θp = 12 and θq =
1
4

16 / 31

Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d

DKL(p‖q) = θp log
θp
θq

+ (1− θp) log
1− θp
1− θq

=
1
2
log

1
2
1
4

+
1
2
log

1
2
3
4

= 1−
1
2
log 3 ≈ 0.2075 bits

DKL(q‖p) = θq log
θq
θp

+ (1− θq) log
1− θq
1− θp

=
1
4
log

1
4
1
2

+
3
4
log

3
4
1
2

= −1 +
3
4
log 3 ≈ 0.1887 bits

17 / 31

1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

18 / 31

Mutual Information
Definition

Let X ,Y be two r.v. with joint distribution p(X ,Y ) and marginals p(X ) and
p(Y ):

Definition
The mutual information I(X ;Y ) is the relative entropy between the joint
distribution p(X ,Y ) and the product distribution p(X )p(Y ):

I(X ;Y ) = DKL (p(X ,Y )‖p(X )p(Y ))

=

x∈X


y∈Y

p(x , y) log
p(x , y)

p(x)p(y)

Non-negativity: I(X ;Y ) ≥ 0
Symmetry: I(Y ;X ) = I(X ;Y )

Intuitively, how much information, on average, X conveys about Y .

19 / 31

Relationship between Entropy and Mutual Information

We can re-write the definition of mutual information as:

I(X ;Y ) =

x∈X


y∈Y

p(x , y) log
p(x , y)

p(x)p(y)

=

x∈X


y∈Y

p(x , y) log
p(x |y)
p(x)

= −

x∈X

log p(x)

y∈Y

p(x , y)−


−∑

x∈X


y∈Y

p(x , y) log p(x |y)


= H(X )− H(X |Y )

The average reduction in uncertainty of X due to the knowledge of Y .

Self-information: I(X ;X ) = H(X )− H(X |X ) = H(X )
20 / 31

Mutual Information:
Properties

Mutual Information is non-negative:

I(X ;Y ) ≥ 0

Mutual Information is symmetric:

I(X ;Y ) = I(Y ;X )

Self-information:

I(X ;X ) = H(X )

Since H(X ,Y ) = H(Y ) + H(X |Y ) we have that:

I(X ;Y ) = H(X )− H(X |Y ) = H(X ) + H(Y )− H(X ,Y )

21 / 31

Breakdown of Joint Entropy

(From Mackay, p140; see his exercise 8.8)

22 / 31

Mutual Information
Example 1 (from Mackay, 2003)

Let X ,Y ,Z be r.v. with X ,Y ∈ {0, 1}, X ⊥⊥ Y and:

p(X = 0) = p p(X = 1) = 1− p
p(Y = 0) = q p(Y = 1) = 1− q

Z = (X + Y ) mod 2

(a) if q = 1/2 what is P(Z = 0)? P(Z = 1)? I(Z ;X )?

(b) For general p and q what is P(Z = 0)? P(Z = 1)? I(Z ;X )?

23 / 31

Mutual Information
Example 1 (from Mackay, 2003) — Solution (a)

(a) As X ⊥⊥ Y and q = 1/2 the noise will flip the outcome of X with
probability q = 0.5 regardless of the outcome of X . Therefore:

p(Z = 1) = 1/2 p(Z = 0) = 1/2

Hence:
I(X ;Z ) = H(Z )− H(Z |X ) = 1− 1 = 0

Indeed for q = 1/2 we see that Z ⊥⊥ X

24 / 31

Mutual Information
Example 1 (from Mackay, 2003) — Solution (b)

(b)

`
def
= p(Z = 0) = p(X = 0)× p(no flip) + p(X = 1)× p(flip)

= pq + (1− p)(1− q)
= 1 + 2pq − q − p

Similarly:

p(Z = 1) = p(X = 1)× p(no flip) + p(X = 0)× p(flip)
= (1− p)q + p(1− q)
= q + p − 2pq

and:

I(Z ;X ) = H(Z )− H(Z |X )
= H(`, 1− `)− H(q, 1− q) why?

25 / 31

1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

26 / 31

Joint Mutual Information

Recall that for random variables X ,Y ,

I(X ;Y ) = H(X )− H(X |Y )
Reduction in uncertainty in X due to knowledge of Y

More generally, for random variables X1, . . . ,Xn,Y1, . . . ,Ym,

I(X1, . . . ,Xn;Y1, . . . ,Ym) = H(X1, . . . ,Xn)− H(X1, . . . ,Xn|Y1, . . . ,Ym)
Reduction in uncertainty in X1, . . . ,Xn due to knowledge of Y1, . . . ,Ym

Symmetry also generalises:

I(X1, . . . ,Xn;Y1, . . . ,Ym) = I(Y1, . . . ,Ym;X1, . . . ,Xn)

27 / 31

Conditional Mutual Information

The conditional mutual information between X and Y given Z = zk :

I(X ;Y |Z = zk) = H(X |Z = zk)− H(X |Y ,Z = zk).

Averaging over Z we obtain:

The conditional mutual information between X and Y given Z :

I(X ;Y |Z ) = H(X |Z )− H(X |Y ,Z )

= Ep(X ,Y ,Z) log
p(X ,Y |Z )

p(X |Z )p(Y |Z )

The reduction in the uncertainty of X due to the knowledge of Y when Z is
given.

Note that I(X ;Y ;Z ), I(X |Y ;Z ) are illegal terms while
e.g. I(A,B;C,D|E ,F ) is legal.

28 / 31

1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

29 / 31

Summary

Decomposability of entropy

Relative entropy

Mutual information

Reading: Mackay §2.5, Ch 8; Cover & Thomas §2.3 to §2.5

30 / 31

Next time

Mutual information chain rule

Jensen’s inequality

“Information cannot hurt”

Data processing inequality

31 / 31

Decomposability of Entropy
Relative Entropy / KL Divergence
Mutual Information
Definition
Joint and Conditional Mutual Information

Wrapping up