COMP2610 / COMP6261 – Information Theory – Lecture 8: Some Fundamental Inequalities
COMP2610 / COMP6261 – Information Theory
Lecture 8: Some Fundamental Inequalities
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
14 August 2018
1 / 37
Last time
Decomposability of entropy
Relative entropy (KL divergence)
Mutual information
2 / 37
Review
Relative entropy (KL divergence):
DKL(p‖q) =
∑
x∈X
p(x) log
p(x)
q(x)
Mutual information:
I(X ;Y ) = DKL (p(X ,Y )‖p(X )p(Y ))
= H(X ) + H(Y )− H(X ,Y )
= H(X )− H(X |Y ).
Average reduction in uncertainty in X when Y is known
I(X ;Y ) = 0 when X ,Y statistically independent
Conditional mutual information of X ,Y given Z :
I(X ;Y |Z ) = H(X |Z )− H(X |Y ,Z )
3 / 37
This time
Mutual information chain rule
Jensen’s inequality
“Information cannot hurt”
Data processing inequality
4 / 37
Outline
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
5 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
6 / 37
Recall: Joint Mutual Information
Recall the mutual information between X and Y :
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) = I(Y ;X ).
We can also compute the mutual information between X1, . . . ,XN and
Y1, . . . ,YM :
I(X1, . . . ,XN ;Y1, . . . ,YM) = H(X1, . . . ,XN) + H(Y1, . . . ,YM)
− H(X1, . . . ,XN ,Y1, . . . ,YM)
= I(Y1, . . . ,YM ;X1, . . . ,XN).
Note that I(X ,Y ;Z ) 6= I(X ;Y ,Z ) in general
Reduction in uncertainty of X and Y given Z versus reduction in uncertainty
of X given Y and Z
7 / 37
Chain Rule for Mutual Information
Let X ,Y ,Z be r.v. and recall that:
p(Z ,Y ) = p(Z |Y )p(Y )
H(Z ,Y ) = H(Z |Y ) + H(Y )
I(X ;Y ,Z ) = I(Y ,Z ;X ) symmetry
= H(Z ,Y )− H(Z ,Y |X ) definition of mutual info.
= H(Z |Y ) + H(Y )− H(Z |X ,Y )− H(Y |X ) entropy’s chain rule
= H(Y )− H(Y |X )︸ ︷︷ ︸
I(Y ;X)
+H(Z |Y )− H(Z |X ,Y )︸ ︷︷ ︸
I(Z ;X |Y )
I(X ;Y ,Z ) = I(X ;Y ) + I(X ;Z |Y ) definition of mutual info and conditional mutual info
Similarly, by symmetry:
I(X ;Y ,Z ) = I(X ;Z ) + I(X ;Y |Z )
8 / 37
Chain Rule for Mutual Information
General form
For any collection of random variables X1, . . . ,XN and Y :
I(X1, . . . ,XN ;Y ) = I(X1;Y ) + I(X2, . . . ,XN ;Y |X1)
= I(X1;Y ) + I(X2;Y |X1) + I(X3, . . . ,XN ;Y |X1,X2)
= . . .
=
N∑
i=1
I(Xi ;Y |X1, . . . ,Xi−1)
=
N∑
i=1
I(Y ;Xi |X1, . . . ,Xi−1).
9 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
10 / 37
Convex Functions:
Introduction
x1 x2x� = �x1 + (1� �)x2f(x�)
�f(x1) + (1� �)f(x2)
0 ≤ λ ≤ 1 (Figure from Mackay, 2003)
A function is convex ^ if every chord of the function lies above the
function 11 / 37
Convex and Concave Functions
Definitions
Definition
A function f (x) is convex ^ over (a, b) if for all x1, x2 ∈ (a, b) and
0 ≤ λ ≤ 1:
f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2)
We say f is strictly convex ^ if for all x1, x2 ∈ (a, b) the equality holds only
for λ = 0 and λ = 1.
Similarly, a function f is concave _ if −f is convex ^, i.e. if every chord of
the function lies below the function.
12 / 37
Examples of Convex and Concave Functions
−5 0 5
0
5
10
15
20
25
x
x2
−3 −2 −1 0 1 2 3
0
5
10
15
20
25
x
ex
0 2 4 6 8 10
−5
0
5
10
15
20
25
x
x logx
−5 0 5
−25
−20
−15
−10
−5
0
x
−x2
0 10 20 30 40
−3
−2
−1
0
1
2
3
4
x
logx
0 10 20 30 40
0
1
2
3
4
5
6
7
x
√
x
13 / 37
Verifying Convexity
Theorem (Cover & Thomas, Th 2.6.1)
If a function f has a second derivative that is non-negative (positive) over
an interval, the function is convex ^(strictly convex ^) over that interval.
This allows us to verify convexity or concavity.
Examples:
x2:
d
dx
(
d
dx
(
x2
))
=
d
dx
(
2x
)
= 2
ex :
d
dx
(
d
dx
(
ex
))
=
d
dx
(
ex
)
= ex
√
x , x > 0:
d
dx
(
d
dx
(√
x
))
=
1
2
d
dx
(
1√
x
)
= −1
4
1√
x3
14 / 37
Convexity, Concavity and Optimization
If f (x) is concave _ and there exists a point at which
df
dx
= 0,
then f (x) has a maximum at that point.
Note: the converse does not hold: if a concave _ f (x) is maximized at
some x , it is not necessarily true that the derivative is zero there.
f (x) = −|x |: is maximized at x = 0 where its derivative is undefined
f (p) = log p with 0 ≤ p ≤ 1, is maximized at p = 1 where dfdp = 1
Similarly for minimisation of convex functions
15 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
16 / 37
Jensen’s Inequality for Convex Functions
Theorem: Jensen’s Inequality
If f is a convex ^ function and X is a random variable then:
f (E[X ]) ≤ E[f (X )].
Moreover, if f is strictly convex ^, the equality implies that X = E[X ] with
probability 1, i.e X is a constant.
In other words, for a probability vector p,
f
(
N∑
i=1
pixi
)
≤
N∑
i=1
pi f (xi).
Similarly for a concave _ function: E[f (X )] ≤ f (E[X ]) .
17 / 37
Jensen’s Inequality for Convex Functions
Proof by Induction
(1) K = 2:
I Two-state random variable X ∈ {x1, x2}
I With p = (p1, p2) = (p1, 1− p1)
I 0 ≤ p ≤ 1
we simply follow the definition of convexity:
p1f (x1) + p2f (x2)︸ ︷︷ ︸
E[f (X)]
≥ f (p1x1 + p2x2︸ ︷︷ ︸
E[X ]
)
18 / 37
Jensen’s Inequality for Convex Functions
Proof by Induction — Cont’d
(2) (K − 1)→ K : Assuming the theorem is true for distributions with
K − 1 states, and writing: p′i = pi/(1− pK ) for i = 1, . . . ,K − 1:
K∑
i=1
pi f (xi) = pK f (xK ) + (1− pK )
K−1∑
i=1
p′i f (xi)
≥ pK f (xK ) + (1− pK )f
(
K−1∑
i=1
p′i xi
)
Induction hypothesis
≥ f
(
pK xK + (1− pK )
K−1∑
i=1
p′i xi
)
︸ ︷︷ ︸∑K
i=1 pi xi
definition of convexity
K∑
i=1
pi f (xi) ≥ f
(
K∑
i=1
pixi
)
⇒ E[f (X )] ≥ f (E[x ]) equality case?
19 / 37
Jensen’s Inequality Example: The AM-GM Inequality
Recall that for a concave _ function: E[f (X )] ≤ f (E[X ]).
Consider X ∈ {x1, . . . , xN}, X ≥ 0 with uniform probability distribution
p = ( 1N , . . . ,
1
N ) and the strictly concave _ function f (x) = log x :
1
N
N∑
i=1
log xi ≤ log
(
1
N
N∑
i=1
xi
)
log
(
N∏
i=1
xi
) 1
N
≤ log
(
1
N
N∑
i=1
xi
)
(
N∏
i=1
xi
) 1
N
≤ 1
N
N∑
i=1
xi
N
√
x1x2 . . . xN ≤
x1 + x2 . . .+ xN
N
20 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
21 / 37
Gibbs’ Inequality
Theorem
The relative entropy (or KL divergence) between two distributions p(X )
and q(X ) with X ∈ X is non-negative:
DKL(p‖q) ≥ 0
with equality if and only if p(x) = q(x) for all x.
22 / 37
Gibbs’ Inequality
Proof (1 of 2)
Recall that: DKL(p‖q) =
∑
x∈X
p(x) log
p(x)
q(x)
= Ep(X)
[
log
p(X )
q(X )
]
Let A = {x : p(x) > 0}. Then:
− DKL(p‖q) =
∑
x∈A
p(x) log
q(x)
p(x)
≤ log
∑
x∈A
p(x)
q(x)
p(x)
Jensen’s inequality
= log
∑
x∈A
q(x)
≤ log
∑
x∈X
q(x)
= log 1
= 0
23 / 37
Gibbs’ Inequality
Proof (2 of 2)
Since log u is strictly convex we have equality if
q(x)
p(x)
= c for all x . Then:
∑
x∈A
q(x) = c
∑
x∈A
p(x) = c
Also, the last inequality in the previous slide becomes equality only if:∑
x∈A
q(x) =
∑
x∈X
q(x).
Therefore c = 1 and DKL(p‖q) = 0⇔ p(x) = q(x) for all x .
Alternative proof: Use the fact that log x ≤ x − 1.
24 / 37
Non-Negativity of Mutual Information
Corollary
For any two random variables X ,Y :
I(X ;Y ) ≥ 0,
with equality if and only if X and Y are statistically independent.
Proof: We simply use the definition of mutual information and Gibbs’
inequality:
I(X ;Y ) = DKL(p(X ,Y )‖p(X )p(Y )) ≥ 0,
with equality if and only if p(X ,Y ) = p(X )p(Y ), i.e. X and Y are
independent.
25 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
26 / 37
Conditioning Reduces Entropy
Information Cannot Hurt — Proof
Theorem
For any two random variables X ,Y ,
H(X |Y ) ≤ H(X ),
with equality if and only if X and Y are independent.
Proof: We simply use the non-negativity of mutual information:
I(X ;Y ) ≥ 0
H(X )− H(X |Y ) ≥ 0
H(X |Y ) ≤ H(X )
with equality if and only if p(X ,Y ) = p(X )p(Y ), i.e X and Y are
independent.
Data are helpful, they don’t increase uncertainty on average.
27 / 37
Conditioning Reduces Entropy
Information Cannot Hurt — Example (from Cover & Thomas, 2006)
Let X ,Y have the following joint distribution:
p(X ,Y ) X
1 2
Y
1 0 3/4
2 1/8 1/8
p(X ) = (1/8, 7/8)
p(Y ) = (3/4, 1/4)
p(X |Y = 1) = (0, 1)
p(X |Y = 2) = (1/2, 1/2)
H(X ) ≈ 0.544 bits H(X |Y = 1) = 0 bits H(X |Y = 2) = 1 bit
We see that in this case H(X |Y = 1) < H(X ), H(X |Y = 2) > H(X ).
However, H(X |Y ) =
∑
y∈{1,2}
p(y)H(X |Y = y) = 1
4
= 0.25 bits < H(X )
H(X |Y = yk) may be greater than H(X ) but the average: H(X |Y ) is
always less or equal to H(X ).
Information cannot hurt on average
28 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
29 / 37
Markov Chain
X" Y" Z"
Definition
Random variables X ,Y ,Z are said to form a Markov chain in that order
(denoted by X → Y → Z ) if their joint probability distribution can be
written as:
p(X ,Y ,Z ) = p(X )p(Y |X )p(Z |Y )
Consequences:
X → Y → Z if and only if X and Z are conditionally independent
given Y .
X → Y → Z implies that Z → Y → X .
If Z = f (Y ), then X → Y → Z
30 / 37
Data-Processing Inequality
Definition
Theorem
if X → Y → Z then: I(X ;Y ) ≥ I(X ;Z )
X is the state of the world, Y is the data gathered and Z is the
processed data
No “clever” manipulation of the data can improve the inferences that
can be made from the data
No processing of Y , deterministic or random, can increase the
information that Y contains about X
31 / 37
Data-Processing Inequality
Proof
Recall that the chain rule for mutual information states that:
I(X ;Y ,Z ) = I(X ;Y ) + I(X ;Z |Y )
= I(X ;Z ) + I(X ;Y |Z )
Therefore:
I(X ;Y ) + I(X ;Z |Y )︸ ︷︷ ︸
0
= I(X ;Z ) + I(X ;Y |Z ) Markov chain assumption
I(X ;Y ) = I(X ;Z ) + I(X ;Y |Z ) but I(X ; Y |Z) ≥ 0
I(X ;Y ) ≥ I(X ;Z )
32 / 37
Data-Processing Inequality
Functions of the Data
Corollary
In particular, if Z = g(Y ) we have that:
I(X ;Y ) ≥ I(X ; g(Y ))
Proof: X → Y → g(Y ) forms a Markov chain.
Functions of the data Y cannot increase the information about X
33 / 37
Data-Processing Inequality
Observation of a “Downstream” Variable
Corollary
If X → Y → Z then I(X ;Y |Z ) ≤ I(X ;Y )
Proof: We use again the chain rule for mutual information:
I(X ;Y ,Z ) = I(X ;Y ) + I(X ;Z |Y )
= I(X ;Z ) + I(X ;Y |Z )
Therefore:
I(X ;Y ) + I(X ;Z |Y )︸ ︷︷ ︸
0
= I(X ;Z ) + I(X ;Y |Z ) Markov chain assumption
I(X ;Y |Z ) = I(X ;Y )− I(X ;Z ) but I(X ; Z) ≥ 0
I(X ;Y |Z ) ≤ I(X ;Y )
The dependence between X and Y cannot be increased by the
observation of a “downstream” variable.
34 / 37
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
35 / 37
Summary & Conclusions
Chain rule for mutual information
Convex Functions
Jensen’s inequality, Gibbs’ inequality
Important inequalities regarding information, inference and data
processing
Reading: Mackay §2.6 to §2.10, Cover & Thomas §2.5 to §2.8
36 / 37
Next time
Law of large numbers
Markov’s inequality
Chebychev’s inequality
37 / 37
Chain Rule for Mutual Information
Convex Functions
Jensen's Inequality
Gibbs' Inequality
Information Cannot Hurt
Data Processing Inequality
Wrapping Up