程序代写代做代考 information theory chain COMP2610 / COMP6261 – Information Theory – Lecture 8: Some Fundamental Inequalities

COMP2610 / COMP6261 – Information Theory – Lecture 8: Some Fundamental Inequalities

COMP2610 / COMP6261 – Information Theory
Lecture 8: Some Fundamental Inequalities

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

14 August 2018

1 / 37

Last time

Decomposability of entropy

Relative entropy (KL divergence)

Mutual information

2 / 37

Review

Relative entropy (KL divergence):

DKL(p‖q) =

x∈X

p(x) log
p(x)
q(x)

Mutual information:

I(X ;Y ) = DKL (p(X ,Y )‖p(X )p(Y ))
= H(X ) + H(Y )− H(X ,Y )
= H(X )− H(X |Y ).

Average reduction in uncertainty in X when Y is known
I(X ;Y ) = 0 when X ,Y statistically independent

Conditional mutual information of X ,Y given Z :

I(X ;Y |Z ) = H(X |Z )− H(X |Y ,Z )
3 / 37

This time

Mutual information chain rule

Jensen’s inequality

“Information cannot hurt”

Data processing inequality

4 / 37

Outline

1 Chain Rule for Mutual Information

2 Convex Functions

3 Jensen’s Inequality

4 Gibbs’ Inequality

5 Information Cannot Hurt

6 Data Processing Inequality

7 Wrapping Up

5 / 37

1 Chain Rule for Mutual Information

2 Convex Functions

3 Jensen’s Inequality

4 Gibbs’ Inequality

5 Information Cannot Hurt

6 Data Processing Inequality

7 Wrapping Up

6 / 37

Recall: Joint Mutual Information

Recall the mutual information between X and Y :

I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) = I(Y ;X ).

We can also compute the mutual information between X1, . . . ,XN and
Y1, . . . ,YM :

I(X1, . . . ,XN ;Y1, . . . ,YM) = H(X1, . . . ,XN) + H(Y1, . . . ,YM)

− H(X1, . . . ,XN ,Y1, . . . ,YM)
= I(Y1, . . . ,YM ;X1, . . . ,XN).

Note that I(X ,Y ;Z ) 6= I(X ;Y ,Z ) in general
Reduction in uncertainty of X and Y given Z versus reduction in uncertainty
of X given Y and Z

7 / 37

Chain Rule for Mutual Information

Let X ,Y ,Z be r.v. and recall that:

p(Z ,Y ) = p(Z |Y )p(Y )
H(Z ,Y ) = H(Z |Y ) + H(Y )

I(X ;Y ,Z ) = I(Y ,Z ;X ) symmetry

= H(Z ,Y )− H(Z ,Y |X ) definition of mutual info.
= H(Z |Y ) + H(Y )− H(Z |X ,Y )− H(Y |X ) entropy’s chain rule
= H(Y )− H(Y |X )︸ ︷︷ ︸

I(Y ;X)

+H(Z |Y )− H(Z |X ,Y )︸ ︷︷ ︸
I(Z ;X |Y )

I(X ;Y ,Z ) = I(X ;Y ) + I(X ;Z |Y ) definition of mutual info and conditional mutual info

Similarly, by symmetry:

I(X ;Y ,Z ) = I(X ;Z ) + I(X ;Y |Z )

8 / 37

Chain Rule for Mutual Information
General form

For any collection of random variables X1, . . . ,XN and Y :

I(X1, . . . ,XN ;Y ) = I(X1;Y ) + I(X2, . . . ,XN ;Y |X1)
= I(X1;Y ) + I(X2;Y |X1) + I(X3, . . . ,XN ;Y |X1,X2)
= . . .

=
N∑

i=1

I(Xi ;Y |X1, . . . ,Xi−1)

=
N∑

i=1

I(Y ;Xi |X1, . . . ,Xi−1).

9 / 37

1 Chain Rule for Mutual Information

2 Convex Functions

3 Jensen’s Inequality

4 Gibbs’ Inequality

5 Information Cannot Hurt

6 Data Processing Inequality

7 Wrapping Up

10 / 37

Convex Functions:
Introduction

x1 x2x� = �x1 + (1� �)x2f(x�)
�f(x1) + (1� �)f(x2)
0 ≤ λ ≤ 1 (Figure from Mackay, 2003)

A function is convex ^ if every chord of the function lies above the
function 11 / 37

Convex and Concave Functions
Definitions

Definition
A function f (x) is convex ^ over (a, b) if for all x1, x2 ∈ (a, b) and
0 ≤ λ ≤ 1:

f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2)
We say f is strictly convex ^ if for all x1, x2 ∈ (a, b) the equality holds only
for λ = 0 and λ = 1.

Similarly, a function f is concave _ if −f is convex ^, i.e. if every chord of
the function lies below the function.

12 / 37

Examples of Convex and Concave Functions

−5 0 5
0

5

10

15

20

25

x

x2

−3 −2 −1 0 1 2 3
0

5

10

15

20

25

x

ex

0 2 4 6 8 10
−5

0

5

10

15

20

25

x

x logx

−5 0 5
−25

−20

−15

−10

−5

0

x

−x2

0 10 20 30 40
−3

−2

−1

0

1

2

3

4

x

logx

0 10 20 30 40
0

1

2

3

4

5

6

7

x


x

13 / 37

Verifying Convexity

Theorem (Cover & Thomas, Th 2.6.1)
If a function f has a second derivative that is non-negative (positive) over
an interval, the function is convex ^(strictly convex ^) over that interval.

This allows us to verify convexity or concavity.

Examples:

x2:
d
dx

(
d
dx

(
x2
))

=
d
dx

(
2x
)
= 2

ex :
d
dx

(
d
dx

(
ex
))

=
d
dx

(
ex
)
= ex


x , x > 0:

d
dx

(
d
dx

(√
x
))

=
1
2

d
dx

(
1√
x

)
= −1

4
1√
x3

14 / 37

Convexity, Concavity and Optimization

If f (x) is concave _ and there exists a point at which

df
dx

= 0,

then f (x) has a maximum at that point.

Note: the converse does not hold: if a concave _ f (x) is maximized at
some x , it is not necessarily true that the derivative is zero there.

f (x) = −|x |: is maximized at x = 0 where its derivative is undefined

f (p) = log p with 0 ≤ p ≤ 1, is maximized at p = 1 where dfdp = 1

Similarly for minimisation of convex functions

15 / 37

1 Chain Rule for Mutual Information

2 Convex Functions

3 Jensen’s Inequality

4 Gibbs’ Inequality

5 Information Cannot Hurt

6 Data Processing Inequality

7 Wrapping Up

16 / 37

Jensen’s Inequality for Convex Functions

Theorem: Jensen’s Inequality
If f is a convex ^ function and X is a random variable then:

f (E[X ]) ≤ E[f (X )].

Moreover, if f is strictly convex ^, the equality implies that X = E[X ] with
probability 1, i.e X is a constant.

In other words, for a probability vector p,

f

(
N∑

i=1

pixi

)

N∑
i=1

pi f (xi).

Similarly for a concave _ function: E[f (X )] ≤ f (E[X ]) .

17 / 37

Jensen’s Inequality for Convex Functions
Proof by Induction

(1) K = 2:
I Two-state random variable X ∈ {x1, x2}

I With p = (p1, p2) = (p1, 1− p1)

I 0 ≤ p ≤ 1
we simply follow the definition of convexity:

p1f (x1) + p2f (x2)︸ ︷︷ ︸
E[f (X)]

≥ f (p1x1 + p2x2︸ ︷︷ ︸
E[X ]

)

18 / 37

Jensen’s Inequality for Convex Functions
Proof by Induction — Cont’d

(2) (K − 1)→ K : Assuming the theorem is true for distributions with
K − 1 states, and writing: p′i = pi/(1− pK ) for i = 1, . . . ,K − 1:

K∑
i=1

pi f (xi) = pK f (xK ) + (1− pK )
K−1∑
i=1

p′i f (xi)

≥ pK f (xK ) + (1− pK )f
(

K−1∑
i=1

p′i xi

)
Induction hypothesis

≥ f
(

pK xK + (1− pK )
K−1∑
i=1

p′i xi

)
︸ ︷︷ ︸∑K

i=1 pi xi

definition of convexity

K∑
i=1

pi f (xi) ≥ f
(

K∑
i=1

pixi

)
⇒ E[f (X )] ≥ f (E[x ]) equality case?

19 / 37

Jensen’s Inequality Example: The AM-GM Inequality

Recall that for a concave _ function: E[f (X )] ≤ f (E[X ]).
Consider X ∈ {x1, . . . , xN}, X ≥ 0 with uniform probability distribution
p = ( 1N , . . . ,

1
N ) and the strictly concave _ function f (x) = log x :

1
N

N∑
i=1

log xi ≤ log
(

1
N

N∑
i=1

xi

)

log

(
N∏

i=1

xi

) 1
N

≤ log
(

1
N

N∑
i=1

xi

)
(

N∏
i=1

xi

) 1
N

≤ 1
N

N∑
i=1

xi

N

x1x2 . . . xN ≤
x1 + x2 . . .+ xN

N

20 / 37

1 Chain Rule for Mutual Information

2 Convex Functions

3 Jensen’s Inequality

4 Gibbs’ Inequality

5 Information Cannot Hurt

6 Data Processing Inequality

7 Wrapping Up

21 / 37

Gibbs’ Inequality

Theorem
The relative entropy (or KL divergence) between two distributions p(X )
and q(X ) with X ∈ X is non-negative:

DKL(p‖q) ≥ 0

with equality if and only if p(x) = q(x) for all x.

22 / 37

Gibbs’ Inequality
Proof (1 of 2)

Recall that: DKL(p‖q) =

x∈X

p(x) log
p(x)
q(x)

= Ep(X)
[
log

p(X )
q(X )

]
Let A = {x : p(x) > 0}. Then:

− DKL(p‖q) =

x∈A

p(x) log
q(x)
p(x)

≤ log

x∈A

p(x)
q(x)
p(x)

Jensen’s inequality

= log

x∈A

q(x)

≤ log

x∈X

q(x)

= log 1

= 0

23 / 37

Gibbs’ Inequality
Proof (2 of 2)

Since log u is strictly convex we have equality if
q(x)
p(x)

= c for all x . Then:


x∈A

q(x) = c

x∈A

p(x) = c

Also, the last inequality in the previous slide becomes equality only if:∑
x∈A

q(x) =

x∈X

q(x).

Therefore c = 1 and DKL(p‖q) = 0⇔ p(x) = q(x) for all x .

Alternative proof: Use the fact that log x ≤ x − 1.

24 / 37

Non-Negativity of Mutual Information

Corollary
For any two random variables X ,Y :

I(X ;Y ) ≥ 0,

with equality if and only if X and Y are statistically independent.

Proof: We simply use the definition of mutual information and Gibbs’
inequality:

I(X ;Y ) = DKL(p(X ,Y )‖p(X )p(Y )) ≥ 0,
with equality if and only if p(X ,Y ) = p(X )p(Y ), i.e. X and Y are
independent.

25 / 37

1 Chain Rule for Mutual Information

2 Convex Functions

3 Jensen’s Inequality

4 Gibbs’ Inequality

5 Information Cannot Hurt

6 Data Processing Inequality

7 Wrapping Up

26 / 37

Conditioning Reduces Entropy
Information Cannot Hurt — Proof

Theorem
For any two random variables X ,Y ,

H(X |Y ) ≤ H(X ),

with equality if and only if X and Y are independent.

Proof: We simply use the non-negativity of mutual information:

I(X ;Y ) ≥ 0
H(X )− H(X |Y ) ≥ 0

H(X |Y ) ≤ H(X )
with equality if and only if p(X ,Y ) = p(X )p(Y ), i.e X and Y are
independent.

Data are helpful, they don’t increase uncertainty on average.

27 / 37

Conditioning Reduces Entropy
Information Cannot Hurt — Example (from Cover & Thomas, 2006)

Let X ,Y have the following joint distribution:

p(X ,Y ) X
1 2

Y
1 0 3/4
2 1/8 1/8

p(X ) = (1/8, 7/8)

p(Y ) = (3/4, 1/4)

p(X |Y = 1) = (0, 1)
p(X |Y = 2) = (1/2, 1/2)

H(X ) ≈ 0.544 bits H(X |Y = 1) = 0 bits H(X |Y = 2) = 1 bit

We see that in this case H(X |Y = 1) < H(X ), H(X |Y = 2) > H(X ).

However, H(X |Y ) =

y∈{1,2}

p(y)H(X |Y = y) = 1
4
= 0.25 bits < H(X ) H(X |Y = yk) may be greater than H(X ) but the average: H(X |Y ) is always less or equal to H(X ). Information cannot hurt on average 28 / 37 1 Chain Rule for Mutual Information 2 Convex Functions 3 Jensen’s Inequality 4 Gibbs’ Inequality 5 Information Cannot Hurt 6 Data Processing Inequality 7 Wrapping Up 29 / 37 Markov Chain X" Y" Z" Definition Random variables X ,Y ,Z are said to form a Markov chain in that order (denoted by X → Y → Z ) if their joint probability distribution can be written as: p(X ,Y ,Z ) = p(X )p(Y |X )p(Z |Y ) Consequences: X → Y → Z if and only if X and Z are conditionally independent given Y . X → Y → Z implies that Z → Y → X . If Z = f (Y ), then X → Y → Z 30 / 37 Data-Processing Inequality Definition Theorem if X → Y → Z then: I(X ;Y ) ≥ I(X ;Z ) X is the state of the world, Y is the data gathered and Z is the processed data No “clever” manipulation of the data can improve the inferences that can be made from the data No processing of Y , deterministic or random, can increase the information that Y contains about X 31 / 37 Data-Processing Inequality Proof Recall that the chain rule for mutual information states that: I(X ;Y ,Z ) = I(X ;Y ) + I(X ;Z |Y ) = I(X ;Z ) + I(X ;Y |Z ) Therefore: I(X ;Y ) + I(X ;Z |Y )︸ ︷︷ ︸ 0 = I(X ;Z ) + I(X ;Y |Z ) Markov chain assumption I(X ;Y ) = I(X ;Z ) + I(X ;Y |Z ) but I(X ; Y |Z) ≥ 0 I(X ;Y ) ≥ I(X ;Z ) 32 / 37 Data-Processing Inequality Functions of the Data Corollary In particular, if Z = g(Y ) we have that: I(X ;Y ) ≥ I(X ; g(Y )) Proof: X → Y → g(Y ) forms a Markov chain. Functions of the data Y cannot increase the information about X 33 / 37 Data-Processing Inequality Observation of a “Downstream” Variable Corollary If X → Y → Z then I(X ;Y |Z ) ≤ I(X ;Y ) Proof: We use again the chain rule for mutual information: I(X ;Y ,Z ) = I(X ;Y ) + I(X ;Z |Y ) = I(X ;Z ) + I(X ;Y |Z ) Therefore: I(X ;Y ) + I(X ;Z |Y )︸ ︷︷ ︸ 0 = I(X ;Z ) + I(X ;Y |Z ) Markov chain assumption I(X ;Y |Z ) = I(X ;Y )− I(X ;Z ) but I(X ; Z) ≥ 0 I(X ;Y |Z ) ≤ I(X ;Y ) The dependence between X and Y cannot be increased by the observation of a “downstream” variable. 34 / 37 1 Chain Rule for Mutual Information 2 Convex Functions 3 Jensen’s Inequality 4 Gibbs’ Inequality 5 Information Cannot Hurt 6 Data Processing Inequality 7 Wrapping Up 35 / 37 Summary & Conclusions Chain rule for mutual information Convex Functions Jensen’s inequality, Gibbs’ inequality Important inequalities regarding information, inference and data processing Reading: Mackay §2.6 to §2.10, Cover & Thomas §2.5 to §2.8 36 / 37 Next time Law of large numbers Markov’s inequality Chebychev’s inequality 37 / 37 Chain Rule for Mutual Information Convex Functions Jensen's Inequality Gibbs' Inequality Information Cannot Hurt Data Processing Inequality Wrapping Up