COMP2610 / COMP6261 – Information Theory – Lecture 7: Relative Entropy and Mutual Information
COMP2610 / COMP6261 – Information Theory
Lecture 7: Relative Entropy and Mutual Information
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
13 August 2018
1 / 31
Last time
Information content and entropy: definition and computation
Entropy and average code length
Entropy and minimum expected number of binary questions
Joint and conditional entropies, chain rule
2 / 31
Information Content: Review
Let X be a random variable with outcomes in X
Let p(x) denote the probability of the outcome x ∈ X
The (Shannon) information content of outcome x is
h(x) = log2
1
p(x)
As p(x)→ 0, h(x)→ +∞ (rare outcomes are more informative)
3 / 31
Entropy: Review
The entropy is the average information content of all outcomes:
H(X ) =
∑
x
p(x)log2
1
p(x)
Entropy is minimised if p is peaked, and maximized if p is uniform:
0 ≤ H(X ) ≤ log|X |
Entropy is related to minimal number of bits needed to describe a random
variable
4 / 31
This time
The decomposability property of entropy
Relative entropy and divergences
Mutual information
5 / 31
Outline
1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information
Definition
Joint and Conditional Mutual Information
4 Wrapping up
6 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)
Let X ∈ {1, 2, 3} be a r.v. created by the following process:
1 Flip a fair coin to determine whether X = 1
2 If X 6= 1 flip another fair coin to determine whether X = 2 or X = 3
The probability distribution of X is given by:
p(X = 1) = 12
p(X = 2) = 14
p(X = 3) = 14
7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003) — Cont’d
By definition,
H(X ) =
1
2
log 2 +
1
4
log 4 +
1
4
log 4 = 1.5 bits.
But imagine learning the value of X gradually:
1 First we learn whether X = 1:
I Binary variable with p(1) = ( 12 ,
1
2 )
I Hence H(1/2, 1/2) = log2 2 = 1 bit.
2 If X 6= 1 we learn the value of the second coin flip:
I Also binary variable with p(2) = ( 12 ,
1
2 )
I Therefore H(1/2, 1/2) = 1 bit.
However, the second revelation only happens half of the time:
H(X ) = H(1/2, 1/2) +
1
2
H(1/2, 1/2) = 1.5 bits.
8 / 31
Decomposability of Entropy
Generalization
For a r.v. with probability distribution p = (p1, . . . , p|X |):
H(p) = H(p1, 1− p1) + (1− p1)H
(
p2
1− p1
, . . . ,
p|X |
1− p1
)
H(p1, 1− p1): entropy for a random variable corresponding to “Is X = 1?”
1− p1: probability of X 6= 1
p2
1−p1
, . . . ,
p|X|
1−p1
: conditional probability of X = 2, . . . , |X | given X 6= 1.
H
(
p2
1−p1
, . . . ,
p|X|
1−p1
)
: entropy for a random variable corresponding to
outcomes when X 6= 1.
9 / 31
Decomposability of Entropy
Generalization
In general, we have that for any m between 1 and |X | − 1:
H(p) =H
m∑
i=1
pi ,
|X |∑
i=m+1
pi
+
(
m∑
i=1
pi
)
H
(
p1∑m
i=1 pi
, . . . ,
pm∑m
i=1 pi
)
+
|X |∑
i=m+1
pi
H
(
pm+1∑|X |
i=m+1 pi
, . . . ,
p|X |∑|X |
i=m+1 pi
)
Apply this formula with m = 1, |X | = 3, p = (p1, p2, p3) = (1/2, 1/4, 1/4)
10 / 31
1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information
Definition
Joint and Conditional Mutual Information
4 Wrapping up
11 / 31
Entropy in Information Theory
If a random variable has distribution p, there exists an encoding with an
average length of
H(p) bits
and this is the “best” possible encoding
What happens if we use a “wrong” encoding?
e.g. because we make an incorrect assumption on the probability distribution
If the true distribution is p, but we assume it is q, it turns out we will need
to use
H(p) + DKL(p‖q) bits
where DKL(p‖q) is some measure of “distance” between p and q
12 / 31
Relative Entropy
Definition
The relative entropy or Kullback-Leibler (KL) divergence between two
probability distributions p(X ) and q(X ) is defined as:
DKL(p‖q) =
∑
x∈X
p(x)
(
log
1
q(x)
− log
1
p(x)
)
=
∑
x∈X
p(x) log
p(x)
q(x)
= Ep(X)
[
log
p(X )
q(X )
]
.
Note:
I Both p(X ) and q(X ) are defined over the same alphabet X
Conventions on log likelihood ratio:
0 log
0
0
def
= 0 0 log
0
q
def
= 0 p log
p
0
def
=∞
13 / 31
Relative Entropy
Properties
DKL(p‖q) ≥ 0 (proof next lecture)
DKL(p‖q) = 0⇔ p = q (proof next lecture)
Not symmetric: DKL(p‖q) 6= DKL(q‖p)
Not satisfy triangle inequality: DKL(p‖q) 6= DKL(p‖r) + DKL(r‖q)
I Not a true distance since is not symmetric and does not satisfy the
triangle inequality
I Hence, “KL divergence” rather than “KL distance”
14 / 31
Relative Entropy
Uniform q
Let q correspond to a uniform distribution: q(x) = 1|X |
Relative entropy between p and q:
DKL(p‖q) =
∑
x∈X
p(x) log
p(x)
q(x)
=
∑
x∈X
p(x) · (log p(x) + log |X |)
= −H(X ) +
∑
x∈X
p(x) · log |X |
= −H(X ) + log |X |.
Matches intuition as penalty on number of bits for encoding
15 / 31
Relative Entropy
Example (from Cover & Thomas, 2006)
Let X ∈ {0, 1} and consider the distributions p(X ) and q(X ) such that:
p(X = 1) = θp p(X = 0) = 1− θp
q(X = 1) = θq q(X = 0) = 1− θq
What distributions are these?
Compute DKL(p‖q) and DKL(q‖p) with θp = 12 and θq =
1
4
16 / 31
Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d
DKL(p‖q) = θp log
θp
θq
+ (1− θp) log
1− θp
1− θq
=
1
2
log
1
2
1
4
+
1
2
log
1
2
3
4
= 1−
1
2
log 3 ≈ 0.2075 bits
DKL(q‖p) = θq log
θq
θp
+ (1− θq) log
1− θq
1− θp
=
1
4
log
1
4
1
2
+
3
4
log
3
4
1
2
= −1 +
3
4
log 3 ≈ 0.1887 bits
17 / 31
1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information
Definition
Joint and Conditional Mutual Information
4 Wrapping up
18 / 31
Mutual Information
Definition
Let X ,Y be two r.v. with joint distribution p(X ,Y ) and marginals p(X ) and
p(Y ):
Definition
The mutual information I(X ;Y ) is the relative entropy between the joint
distribution p(X ,Y ) and the product distribution p(X )p(Y ):
I(X ;Y ) = DKL (p(X ,Y )‖p(X )p(Y ))
=
∑
x∈X
∑
y∈Y
p(x , y) log
p(x , y)
p(x)p(y)
Non-negativity: I(X ;Y ) ≥ 0
Symmetry: I(Y ;X ) = I(X ;Y )
Intuitively, how much information, on average, X conveys about Y .
19 / 31
Relationship between Entropy and Mutual Information
We can re-write the definition of mutual information as:
I(X ;Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
p(x , y)
p(x)p(y)
=
∑
x∈X
∑
y∈Y
p(x , y) log
p(x |y)
p(x)
= −
∑
x∈X
log p(x)
∑
y∈Y
p(x , y)−
−∑
x∈X
∑
y∈Y
p(x , y) log p(x |y)
= H(X )− H(X |Y )
The average reduction in uncertainty of X due to the knowledge of Y .
Self-information: I(X ;X ) = H(X )− H(X |X ) = H(X )
20 / 31
Mutual Information:
Properties
Mutual Information is non-negative:
I(X ;Y ) ≥ 0
Mutual Information is symmetric:
I(X ;Y ) = I(Y ;X )
Self-information:
I(X ;X ) = H(X )
Since H(X ,Y ) = H(Y ) + H(X |Y ) we have that:
I(X ;Y ) = H(X )− H(X |Y ) = H(X ) + H(Y )− H(X ,Y )
21 / 31
Breakdown of Joint Entropy
(From Mackay, p140; see his exercise 8.8)
22 / 31
Mutual Information
Example 1 (from Mackay, 2003)
Let X ,Y ,Z be r.v. with X ,Y ∈ {0, 1}, X ⊥⊥ Y and:
p(X = 0) = p p(X = 1) = 1− p
p(Y = 0) = q p(Y = 1) = 1− q
Z = (X + Y ) mod 2
(a) if q = 1/2 what is P(Z = 0)? P(Z = 1)? I(Z ;X )?
(b) For general p and q what is P(Z = 0)? P(Z = 1)? I(Z ;X )?
23 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (a)
(a) As X ⊥⊥ Y and q = 1/2 the noise will flip the outcome of X with
probability q = 0.5 regardless of the outcome of X . Therefore:
p(Z = 1) = 1/2 p(Z = 0) = 1/2
Hence:
I(X ;Z ) = H(Z )− H(Z |X ) = 1− 1 = 0
Indeed for q = 1/2 we see that Z ⊥⊥ X
24 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (b)
(b)
`
def
= p(Z = 0) = p(X = 0)× p(no flip) + p(X = 1)× p(flip)
= pq + (1− p)(1− q)
= 1 + 2pq − q − p
Similarly:
p(Z = 1) = p(X = 1)× p(no flip) + p(X = 0)× p(flip)
= (1− p)q + p(1− q)
= q + p − 2pq
and:
I(Z ;X ) = H(Z )− H(Z |X )
= H(`, 1− `)− H(q, 1− q) why?
25 / 31
1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information
Definition
Joint and Conditional Mutual Information
4 Wrapping up
26 / 31
Joint Mutual Information
Recall that for random variables X ,Y ,
I(X ;Y ) = H(X )− H(X |Y )
Reduction in uncertainty in X due to knowledge of Y
More generally, for random variables X1, . . . ,Xn,Y1, . . . ,Ym,
I(X1, . . . ,Xn;Y1, . . . ,Ym) = H(X1, . . . ,Xn)− H(X1, . . . ,Xn|Y1, . . . ,Ym)
Reduction in uncertainty in X1, . . . ,Xn due to knowledge of Y1, . . . ,Ym
Symmetry also generalises:
I(X1, . . . ,Xn;Y1, . . . ,Ym) = I(Y1, . . . ,Ym;X1, . . . ,Xn)
27 / 31
Conditional Mutual Information
The conditional mutual information between X and Y given Z = zk :
I(X ;Y |Z = zk) = H(X |Z = zk)− H(X |Y ,Z = zk).
Averaging over Z we obtain:
The conditional mutual information between X and Y given Z :
I(X ;Y |Z ) = H(X |Z )− H(X |Y ,Z )
= Ep(X ,Y ,Z) log
p(X ,Y |Z )
p(X |Z )p(Y |Z )
The reduction in the uncertainty of X due to the knowledge of Y when Z is
given.
Note that I(X ;Y ;Z ), I(X |Y ;Z ) are illegal terms while
e.g. I(A,B;C,D|E ,F ) is legal.
28 / 31
1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information
Definition
Joint and Conditional Mutual Information
4 Wrapping up
29 / 31
Summary
Decomposability of entropy
Relative entropy
Mutual information
Reading: Mackay §2.5, Ch 8; Cover & Thomas §2.3 to §2.5
30 / 31
Next time
Mutual information chain rule
Jensen’s inequality
“Information cannot hurt”
Data processing inequality
31 / 31
Decomposability of Entropy
Relative Entropy / KL Divergence
Mutual Information
Definition
Joint and Conditional Mutual Information
Wrapping up