COMP2610 / COMP6261 – Information Theory Lecture 8: Some Fundamental Inequalities
U Logo Use Guidelines Robert C. Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:
Copyright By PowCoder代写 加微信 powcoder
earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
14 August 2018
Decomposability of entropy
Relative entropy (KL divergence)
Mutual information
Relative entropy (KL divergence):
DKL(p∥q) = p(x) log p(x)
Mutual information:
I(X;Y) = DKL (p(X,Y)∥p(X)p(Y)) = H (X ) + H (Y ) − H (X , Y )
= H(X) − H(X|Y). Average reduction in uncertainty in X when Y is known
I (X ; Y ) = 0 when X , Y statistically independent
Conditional mutual information of X , Y given Z :
I (X ; Y |Z ) = H (X |Z ) − H (X |Y , Z )
Mutual information chain rule Jensen’s inequality “Information cannot hurt” Data processing inequality
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Recall: Joint Mutual Information
Recall the mutual information between X and Y :
I (X ; Y ) = H (X ) + H (Y ) − H (X , Y ) = I (Y ; X ).
WecanalsocomputethemutualinformationbetweenX1,…,XN and Y1,…,YM:
I(X1, . . . , XN ; Y1, . . . , YM ) = H(X1, . . . , XN ) + H(Y1, . . . , YM ) −H(X1,…,XN,Y1,…,YM)
= I(Y1,…,YM;X1,…,XN). Note that I(X,Y;Z) ̸= I(X;Y,Z) in general
Reduction in uncertainty of X and Y given Z versus reduction in uncertainty ofX givenY andZ
Chain Rule for Mutual Information
Let X,Y,Z be r.v. and recall that:
p(Z,Y) = p(Z|Y)p(Y)
H (Z , Y ) = H (Z |Y ) + H (Y )
I(X;Y,Z) = I(Y,Z;X) symmetry =H(Z,Y)−H(Z,Y|X) definitionofmutualinfo.
=H(Z|Y)+H(Y)−H(Z|X,Y)−H(Y|X) entropy’schainrule = H (Y ) − H (Y |X ) + H (Z |Y ) − H (Z |X , Y )
I(Y;X) I(Z;X|Y) I(X;Y,Z)=I(X;Y)+I(X;Z|Y) definitionofmutualinfoandconditionalmutualinfo
Similarly, by symmetry:
I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z )
Chain Rule for Mutual Information General form
ForanycollectionofrandomvariablesX1,…,XN andY:
I(X1, . . . , XN ; Y ) = I(X1; Y ) + I(X2, . . . , XN ; Y |X1)
= I(X1; Y ) + I(X2; Y |X1) + I(X3, . . . , XN ; Y |X1, X2)
= I(Xi;Y|X1,…,Xi−1) i=1
= I(Y;Xi|X1,…,Xi−1).
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Convex Functions: Introduction
0 ≤ λ ≤ 1 (Figure from Mackay, 2003)
A function is convex ⌣ if every chord of the function lies above the function
Convex and Concave Functions Definitions
Definition
A function f(x) is convex ⌣ over (a,b) if for all x1,x2 ∈ (a,b) and
0 ≤ λ ≤ 1:
We say f is strictly convex ⌣ if for all x1, x2 ∈ (a, b) the equality holds only
f(λx1 + (1 − λ)x2) ≤ λf(x1) + (1 − λ)f(x2) for λ = 0 and λ = 1.
Similarly, a function f is concave ⌢ if −f is convex ⌣, i.e. if every chord of the function lies below the function.
Examples of Convex and Concave Functions
−5 0 5 −3 −2 −1 0 1 2 3 0 2 4 6 8 10
03 −1 2 −2 1
−5 0 5 0 10 20 30 40 0 10 20 30 40
Verifying Convexity
Theorem (Cover & Thomas, Th 2.6.1)
If a function f has a second derivative that is non-negative (positive) over an interval, the function is convex ⌣(strictly convex ⌣) over that interval.
This allows us to verify convexity or concavity. Examples:
2 dd2 d x:dxdxx =dx2x=2
x d d x d x x e:dxdxe =dxe=e
√ dd√1d1 11 x, x > 0: dx dx x = 2 dx √x = −4 √x3
Convexity, Concavity and Optimization
If f (x ) is concave ⌢ and there exists a point at which df =0,
dx then f(x) has a maximum at that point.
Note: the converse does not hold: if a concave ⌢ f (x ) is maximized at some x, it is not necessarily true that the derivative is zero there.
f(x) = −|x|: is maximized at x = 0 where its derivative is undefined
f(p)=logpwith0≤p≤1,ismaximizedatp=1where df =1 dp
Similarly for minimisation of convex functions
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Jensen’s Inequality for Convex Functions
Theorem: Jensen’s Inequality
If f is a convex ⌣ function and X is a random variable then: f(E[X]) ≤ E[f(X)].
Moreover, if f is strictly convex ⌣, the equality implies that X = E[X ] with probability 1, i.e X is a constant.
In other words, for a probability vector p,
f pixi ≤ pif(xi).
Similarly for a concave ⌢ function: E[f (X )] ≤ f (E[X ]) .
Jensen’s Inequality for Convex Functions Proof by Induction
(1) K = 2:
Two-state random variable X ∈ {x1, x2}
With p = (p1,p2) = (p1,1 − p1) 0≤p≤1
we simply follow the definition of convexity:
p1f(x1)+p2f(x2)≥f(p1x1 +p2x2)
E[f (X )] E[X ]
Jensen’s Inequality for Convex Functions Proof by Induction — Cont’d
(2) (K − 1) → K : Assuming the theorem is true for distributions with K −1states,andwriting: pi′ =pi/(1−pK)fori =1,…,K −1:
p i f ( x i ) = p K f ( x K ) + ( 1 − p K ) p i′ f ( x i )
≥ pK f (xK ) + (1 − pK )f pi′xi i=1
K−1 pK xK + (1 − pK ) pi′xi
Induction hypothesis
definition of convexity
Ki = 1 p i x i KK
pif(xi) ≥ f pixi ⇒ E[f(X)] ≥ f(E[x]) i=1 i=1
equality case?
Jensen’s Inequality Example: The AM-GM Inequality Recall that for a concave ⌢ function: E[f (X )] ≤ f (E[X ]).
Consider X ∈ {x1, . . . , xN }, X ≥ 0 with uniform probability distribution p = (N1 ,…, N1 ) and the strictly concave ⌢ function f(x) = logx:
xi ≤ log i=1
logxi ≤log 1
1N N xi
xi ≤ xi N
√N x 1 x 2 . . . x N ≤ x 1 + x 2 . . . + x N
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Gibbs’ Inequality
The relative entropy (or KL divergence) between two distributions p(X) and q(X ) with X ∈ X is non-negative:
DKL(p∥q) ≥ 0 with equality if and only if p(x) = q(x) for all x.
Gibbs’ Inequality Proof (1 of 2)
Recall that: DKL(p∥q) = LetA={x :p(x)>0}. Then:
p(X) log q(X)
p(x)log q(x) = Ep(X) − DKL(p∥q) = p(x) log q(x)
= log q(x)
≤ log q(x) x∈X
= log 1 =0
Jensen’s inequality
Gibbs’ Inequality Proof (2 of 2)
Since log u is strictly convex we have equality if q(x) = c for all x. Then: p(x)
q(x) = c p(x) = c x∈A x∈A
Also, the last inequality in the previous slide becomes equality only if: q(x) = q(x).
x∈A x∈X Thereforec=1andDKL(p∥q)=0⇔p(x)=q(x) forall x.
Alternative proof: Use the fact that log x ≤ x − 1.
Non-Negativity of Mutual Information
For any two random variables X , Y : I(X;Y) ≥ 0,
with equality if and only if X and Y are statistically independent. Proof: We simply use the definition of mutual information and Gibbs’
inequality:
I(X;Y) = DKL(p(X,Y)∥p(X)p(Y)) ≥ 0,
with equality if and only if p(X,Y) = p(X)p(Y), i.e. X and Y are independent.
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Conditioning Reduces Entropy Information Cannot Hurt — Proof
For any two random variables X , Y ,
H(X|Y) ≤ H(X),
with equality if and only if X and Y are independent.
Proof: We simply use the non-negativity of mutual information:
I(X;Y) ≥ 0 H(X) − H(X|Y) ≥ 0
H(X|Y) ≤ H(X)
with equality if and only if p(X,Y) = p(X)p(Y), i.e X and Y are
independent.
Data are helpful, they don’t increase uncertainty on average.
Conditioning Reduces Entropy
Information Cannot Hurt — Example (from Cover & Thomas, 2006)
Let X , Y have the following joint distribution:
p(X,Y) X 12
H(X)≈0.544bits
p(X ) = (1/8, 7/8)
p(Y ) = (3/4, 1/4) p(X|Y =1)=(0,1)
p(X|Y =2)=(1/2,1/2) H(X|Y =1)=0bits H(X|Y =2)=1bit
1 0 3/4 2 1/8 1/8
We see that in this case H(X|Y = 1) < H(X), H(X|Y = 2) > H(X).
However, H(X|Y) = p(y)H(X|Y = y) = 14 = 0.25 bits < H(X) y ∈{1,2}
H(X|Y = yk) may be greater than H(X) but the average: H(X|Y) is always less or equal to H(X).
Information cannot hurt on average
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Markov Chain
Definition
Random variables X , Y , Z are said to form a Markov chain in that order (denoted by X → Y → Z ) if their joint probability distribution can be written as:
p(X,Y,Z) = p(X)p(Y|X)p(Z|Y)
X → Y → Z if and only if X and Z are conditionally independent
X →Y →Z impliesthatZ →Y →X. IfZ =f(Y),thenX →Y →Z
Consequences:
Data-Processing Inequality Definition
ifX →Y →Z then: I(X;Y)≥I(X;Z)
X is the state of the world, Y is the data gathered and Z is the
processed data
No “clever” manipulation of the data can improve the inferences that can be made from the data
No processing of Y , deterministic or random, can increase the information that Y contains about X
Data-Processing Inequality Proof
Recall that the chain rule for mutual information states that:
I (X ; Y , Z ) = I (X ; Y ) + I (X ; Z |Y ) = I (X ; Z ) + I (X ; Y |Z )
Therefore:
I(X;Y)+I(X;Z|Y)=I(X;Z)+I(X;Y|Z) Markovchainassumption
I(X;Y)=I(X;Z)+I(X;Y|Z) butI(X;Y|Z) ≥ 0 I(X;Y) ≥ I(X;Z)
Data-Processing Inequality Functions of the Data
In particular, if Z = g(Y) we have that:
I(X;Y) ≥ I(X;g(Y))
Proof: X → Y → g(Y) forms a Markov chain.
Functions of the data Y cannot increase the information about X
Data-Processing Inequality Observation of a “Downstream” Variable
IfX →Y →Z thenI(X;Y|Z)≤I(X;Y) Proof: We use again the chain rule for mutual information:
I (X ; Y , Z ) = I (X ; Y ) + I (X ; Z |Y ) = I (X ; Z ) + I (X ; Y |Z )
Therefore:
I(X;Y)+I(X;Z|Y)=I(X;Z)+I(X;Y|Z) Markovchainassumption
I(X;Y|Z)=I(X;Y)−I(X;Z) butI(X;Z)≥0 I(X;Y|Z) ≤ I(X;Y)
The dependence between X and Y cannot be increased by the observation of a “downstream” variable.
1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up
Summary & Conclusions
Chain rule for mutual information
Convex Functions
Jensen’s inequality, Gibbs’ inequality
Important inequalities regarding information, inference and data processing
Reading: Mackay §2.6 to §2.10, Cover & Thomas §2.5 to §2.8
Law of large numbers
Markov’s inequality
Chebychev’s inequality
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com