计算机代考 COMP2610 / COMP6261 – Information Theory Lecture 8: Some Fundamental Inequ

程序代写 CS代考 / information theory

COMP2610 / COMP6261 – Information Theory Lecture 8: Some Fundamental Inequalities
U Logo Use Guidelines Robert C. Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:

Copyright By PowCoder代写加微信 powcoder

earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
14 August 2018

Decomposability of entropy
Relative entropy (KL divergence)
Mutual information

Relative entropy (KL divergence):
DKL(p∥q) = 􏰄 p(x) log p(x)
Mutual information:
I(X;Y) = DKL (p(X,Y)∥p(X)p(Y)) = H (X ) + H (Y ) − H (X , Y )
= H(X) − H(X|Y). Average reduction in uncertainty in X when Y is known
I (X ; Y ) = 0 when X , Y statistically independent
Conditional mutual information of X , Y given Z :
I (X ; Y |Z ) = H (X |Z ) − H (X |Y , Z )

Mutual information chain rule Jensen’s inequality “Information cannot hurt” Data processing inequality

1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up

1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up

Recall: Joint Mutual Information
Recall the mutual information between X and Y :
I (X ; Y ) = H (X ) + H (Y ) − H (X , Y ) = I (Y ; X ).
WecanalsocomputethemutualinformationbetweenX1,…,XN and Y1,…,YM:
I(X1, . . . , XN ; Y1, . . . , YM ) = H(X1, . . . , XN ) + H(Y1, . . . , YM ) −H(X1,…,XN,Y1,…,YM)
= I(Y1,…,YM;X1,…,XN). Note that I(X,Y;Z) ̸= I(X;Y,Z) in general
Reduction in uncertainty of X and Y given Z versus reduction in uncertainty ofX givenY andZ

Chain Rule for Mutual Information
Let X,Y,Z be r.v. and recall that:
p(Z,Y) = p(Z|Y)p(Y)
H (Z , Y ) = H (Z |Y ) + H (Y )
I(X;Y,Z) = I(Y,Z;X) symmetry =H(Z,Y)−H(Z,Y|X) definitionofmutualinfo.
=H(Z|Y)+H(Y)−H(Z|X,Y)−H(Y|X) entropy’schainrule = H (Y ) − H (Y |X ) + H (Z |Y ) − H (Z |X , Y )
􏰘 􏰗􏰖 􏰙􏰘 􏰗􏰖 􏰙
I(Y;X) I(Z;X|Y) I(X;Y,Z)=I(X;Y)+I(X;Z|Y) definitionofmutualinfoandconditionalmutualinfo
Similarly, by symmetry:
I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z )

Chain Rule for Mutual Information General form
ForanycollectionofrandomvariablesX1,…,XN andY:
I(X1, . . . , XN ; Y ) = I(X1; Y ) + I(X2, . . . , XN ; Y |X1)
= I(X1; Y ) + I(X2; Y |X1) + I(X3, . . . , XN ; Y |X1, X2)
= 􏰄I(Xi;Y|X1,…,Xi−1) i=1
= 􏰄I(Y;Xi|X1,…,Xi−1).

1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up

Convex Functions: Introduction
􏱫􏰮􏱁􏰾􏰢􏱂 􏱭 􏱁􏰢 􏱯 􏱫􏱂􏰮􏱁􏰾􏰣􏱂
􏰾􏰢 􏱬 􏰮􏱁􏰾􏱬􏱂 􏰾􏰣
􏰾 􏱮 􏱫􏰾􏰢 􏱭 􏱁􏰢 􏱯 􏱫􏱂􏰾􏰣
0 ≤ λ ≤ 1 (Figure from Mackay, 2003)
A function is convex ⌣ if every chord of the function lies above the function

Convex and Concave Functions Definitions
Definition
A function f(x) is convex ⌣ over (a,b) if for all x1,x2 ∈ (a,b) and
0 ≤ λ ≤ 1:
We say f is strictly convex ⌣ if for all x1, x2 ∈ (a, b) the equality holds only
f(λx1 + (1 − λ)x2) ≤ λf(x1) + (1 − λ)f(x2) for λ = 0 and λ = 1.
Similarly, a function f is concave ⌢ if −f is convex ⌣, i.e. if every chord of the function lies below the function.

Examples of Convex and Concave Functions
−5 0 5 −3 −2 −1 0 1 2 3 0 2 4 6 8 10
03 −1 2 −2 1
−5 0 5 0 10 20 30 40 0 10 20 30 40

Verifying Convexity
Theorem (Cover & Thomas, Th 2.6.1)
If a function f has a second derivative that is non-negative (positive) over an interval, the function is convex ⌣(strictly convex ⌣) over that interval.
This allows us to verify convexity or concavity. Examples:
2 d􏰂d􏰀2􏰁􏰃 d􏰀 􏰁 x:dxdxx =dx2x=2
x d 􏰂 d 􏰀 x􏰁􏰃 d 􏰀 x􏰁 x e:dxdxe =dxe=e
√ d􏰂d􏰀√􏰁􏰃1d􏰂1􏰃 11 x, x > 0: dx dx x = 2 dx √x = −4 √x3

Convexity, Concavity and Optimization
If f (x ) is concave ⌢ and there exists a point at which df =0,
dx then f(x) has a maximum at that point.
Note: the converse does not hold: if a concave ⌢ f (x ) is maximized at some x, it is not necessarily true that the derivative is zero there.
f(x) = −|x|: is maximized at x = 0 where its derivative is undefined
f(p)=logpwith0≤p≤1,ismaximizedatp=1where df =1 dp
Similarly for minimisation of convex functions

1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up

Jensen’s Inequality for Convex Functions
Theorem: Jensen’s Inequality
If f is a convex ⌣ function and X is a random variable then: f(E[X]) ≤ E[f(X)].
Moreover, if f is strictly convex ⌣, the equality implies that X = E[X ] with probability 1, i.e X is a constant.
In other words, for a probability vector p,
f 􏰄pixi ≤ 􏰄pif(xi).
Similarly for a concave ⌢ function: E[f (X )] ≤ f (E[X ]) .

Jensen’s Inequality for Convex Functions Proof by Induction
(1) K = 2:
􏰜 Two-state random variable X ∈ {x1, x2}
􏰜 With p = (p1,p2) = (p1,1 − p1) 􏰜 0≤p≤1
we simply follow the definition of convexity:
p1f(x1)+p2f(x2)≥f(p1x1 +p2x2) 􏰘 􏰗􏰖 􏰙 􏰘 􏰗􏰖 􏰙
E[f (X )] E[X ]

Jensen’s Inequality for Convex Functions Proof by Induction — Cont’d
(2) (K − 1) → K : Assuming the theorem is true for distributions with K −1states,andwriting: pi′ =pi/(1−pK)fori =1,…,K −1:
􏰄 p i f ( x i ) = p K f ( x K ) + ( 1 − p K ) 􏰄 p i′ f ( x i )
≥ pK f (xK ) + (1 − pK )f 􏰄 pi′xi i=1
􏰎 K−1􏰏 pK xK + (1 − pK ) 􏰄 pi′xi
Induction hypothesis
definition of convexity
􏰐 Ki = 1 p i x i K􏰎K􏰏
􏰄pif(xi) ≥ f 􏰄pixi ⇒ E[f(X)] ≥ f(E[x]) i=1 i=1
equality case?

Jensen’s Inequality Example: The AM-GM Inequality Recall that for a concave ⌢ function: E[f (X )] ≤ f (E[X ]).
Consider X ∈ {x1, . . . , xN }, X ≥ 0 with uniform probability distribution p = (N1 ,…, N1 ) and the strictly concave ⌢ function f(x) = logx:
􏰕 xi ≤ log i=1
logxi ≤log 1
􏰎1􏰄N 􏰏 N xi
􏰕xi ≤ 􏰄xi N
√N x 1 x 2 . . . x N ≤ x 1 + x 2 . . . + x N

1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up

Gibbs’ Inequality
The relative entropy (or KL divergence) between two distributions p(X) and q(X ) with X ∈ X is non-negative:
DKL(p∥q) ≥ 0 with equality if and only if p(x) = q(x) for all x.

Gibbs’ Inequality Proof (1 of 2)
Recall that: DKL(p∥q) = LetA={x :p(x)>0}. Then:
􏰋 p(X)􏰌 log q(X)
p(x)log q(x) = Ep(X) − DKL(p∥q) = 􏰄 p(x) log q(x)
= log 􏰄 q(x)
≤ log 􏰄 q(x) x∈X
= log 1 =0
Jensen’s inequality

Gibbs’ Inequality Proof (2 of 2)
Since log u is strictly convex we have equality if q(x) = c for all x. Then: p(x)
􏰄 q(x) = c 􏰄 p(x) = c x∈A x∈A
Also, the last inequality in the previous slide becomes equality only if: 􏰄 q(x) = 􏰄 q(x).
x∈A x∈X Thereforec=1andDKL(p∥q)=0⇔p(x)=q(x) forall x.
Alternative proof: Use the fact that log x ≤ x − 1.

Non-Negativity of Mutual Information
For any two random variables X , Y : I(X;Y) ≥ 0,
with equality if and only if X and Y are statistically independent. Proof: We simply use the definition of mutual information and Gibbs’
inequality:
I(X;Y) = DKL(p(X,Y)∥p(X)p(Y)) ≥ 0,
with equality if and only if p(X,Y) = p(X)p(Y), i.e. X and Y are independent.

1 Chain Rule for Mutual Information
2 Convex Functions
3 Jensen’s Inequality
4 Gibbs’ Inequality
5 Information Cannot Hurt
6 Data Processing Inequality
7 Wrapping Up

Conditioning Reduces Entropy Information Cannot Hurt — Proof
For any two random variables X , Y ,
H(X|Y) ≤ H(X),
with equality if and only if X and Y are independent.
Proof: We simply use the non-negativity of mutual information:
I(X;Y) ≥ 0 H(X) − H(X|Y) ≥ 0
H(X|Y) ≤ H(X)
with equality if and only if p(X,Y) = p(X)p(Y), i.e X and Y are
independent.
Data are helpful, they don’t increase uncertainty on average.

Conditioning Reduces Entropy
Information Cannot Hurt — Example (from Cover & Thomas, 2006)
Let X , Y have the following joint distribution:
p(X,Y) X 12
H(X)≈0.544bits
p(X ) = (1/8, 7/8)
p(Y ) = (3/4, 1/4) p(X|Y =1)=(0,1)
p(X|Y =2)=(1/2,1/2) H(X|Y =1)=0bits H(X|Y =2)=1bit
1 0 3/4 2 1/8 1/8
We see that in this case H(X|Y = 1) < H(X), H(X|Y = 2) > H(X).
However, H(X|Y) = 􏰄 p(y)H(X|Y = y) = 14 = 0.25 bits < H(X) y ∈{1,2} H(X|Y = yk) may be greater than H(X) but the average: H(X|Y) is always less or equal to H(X). Information cannot hurt on average 1 Chain Rule for Mutual Information 2 Convex Functions 3 Jensen’s Inequality 4 Gibbs’ Inequality 5 Information Cannot Hurt 6 Data Processing Inequality 7 Wrapping Up Markov Chain Definition Random variables X , Y , Z are said to form a Markov chain in that order (denoted by X → Y → Z ) if their joint probability distribution can be written as: p(X,Y,Z) = p(X)p(Y|X)p(Z|Y) X → Y → Z if and only if X and Z are conditionally independent X →Y →Z impliesthatZ →Y →X. IfZ =f(Y),thenX →Y →Z Consequences: Data-Processing Inequality Definition ifX →Y →Z then: I(X;Y)≥I(X;Z) X is the state of the world, Y is the data gathered and Z is the processed data No “clever” manipulation of the data can improve the inferences that can be made from the data No processing of Y , deterministic or random, can increase the information that Y contains about X Data-Processing Inequality Proof Recall that the chain rule for mutual information states that: I (X ; Y , Z ) = I (X ; Y ) + I (X ; Z |Y ) = I (X ; Z ) + I (X ; Y |Z ) Therefore: I(X;Y)+I(X;Z|Y)=I(X;Z)+I(X;Y|Z) Markovchainassumption 􏰘 􏰗􏰖 􏰙 I(X;Y)=I(X;Z)+I(X;Y|Z) butI(X;Y|Z) ≥ 0 I(X;Y) ≥ I(X;Z) Data-Processing Inequality Functions of the Data In particular, if Z = g(Y) we have that: I(X;Y) ≥ I(X;g(Y)) Proof: X → Y → g(Y) forms a Markov chain. Functions of the data Y cannot increase the information about X Data-Processing Inequality Observation of a “Downstream” Variable IfX →Y →Z thenI(X;Y|Z)≤I(X;Y) Proof: We use again the chain rule for mutual information: I (X ; Y , Z ) = I (X ; Y ) + I (X ; Z |Y ) = I (X ; Z ) + I (X ; Y |Z ) Therefore: I(X;Y)+I(X;Z|Y)=I(X;Z)+I(X;Y|Z) Markovchainassumption 􏰘 􏰗􏰖 􏰙 I(X;Y|Z)=I(X;Y)−I(X;Z) butI(X;Z)≥0 I(X;Y|Z) ≤ I(X;Y) The dependence between X and Y cannot be increased by the observation of a “downstream” variable. 1 Chain Rule for Mutual Information 2 Convex Functions 3 Jensen’s Inequality 4 Gibbs’ Inequality 5 Information Cannot Hurt 6 Data Processing Inequality 7 Wrapping Up Summary & Conclusions Chain rule for mutual information Convex Functions Jensen’s inequality, Gibbs’ inequality Important inequalities regarding information, inference and data processing Reading: Mackay §2.6 to §2.10, Cover & Thomas §2.5 to §2.8 Law of large numbers Markov’s inequality Chebychev’s inequality 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com