CS考试辅导 COMP2610 / COMP6261 – Information Theory Lecture 7: Relative Entropy and

COMP2610 / COMP6261 – Information Theory Lecture 7: Relative Entropy and Mutual Information
U Logo Use Guidelines . Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:

Copyright By PowCoder代写 加微信 powcoder

earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
13 August 2018

Information content and entropy: definition and computation
Entropy and average code length
Entropy and minimum expected number of binary questions
Joint and conditional entropies, chain rule

Information Content: Review
Let X be a random variable with outcomes in X
Let p(x) denote the probability of the outcome x ∈ X The (Shannon) information content of outcome x is
h(x) = log 1
As p(x) → 0, h(x) → +∞ (rare outcomes are more informative)

Entropy: Review
The entropy is the average information content of all outcomes: H(X) = 􏰄p(x)log 1
Entropy is minimised if p is peaked, and maximized if p is uniform:
0 ≤ H(X) ≤ log|X|
Entropy is related to minimal number of bits needed to describe a random variable

The decomposability property of entropy
Relative entropy and divergences
Mutual information

1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information Definition
Joint and Conditional Mutual Information
4 Wrapping up

Decomposability of Entropy Example 1 (Mackay, 2003)
Let X ∈ {1, 2, 3} be a r.v. created by the following process:
1 Flip a fair coin to determine whether X = 1
2 IfX ̸=1flipanotherfaircointodeterminewhetherX =2orX =3
The probability distribution of X is given by:
p(X =1)= p(X =2)= p(X =3)=

Decomposability of Entropy Example 1 (Mackay, 2003) — Cont’d
By definition,
H (X ) = 21 log 2 + 41 log 4 + 14 log 4 = 1.5 bits.
But imagine learning the value of X gradually:
1 First we learn whether X = 1:
􏰜 Binary variable with p(1) = ( 21 , 12 )
􏰜 Hence H(1/2, 1/2) = log2 2 = 1 bit.
2 If X ̸= 1 we learn the value of the second coin flip:
􏰜 Also binary variable with p(2) = ( 12 , 12 )
􏰜 Therefore H(1/2,1/2) = 1 bit.
However, the second revelation only happens half of the time: H (X ) = H (1/2, 1/2) + 12 H (1/2, 1/2) = 1.5 bits.

Decomposability of Entropy Generalization
For a r.v. with probability distribution p = (p1, . . . , p|X |):
􏰂 p2 p|X| 􏰃
H(p)=H(p1,1−p1)+(1−p1)H 1−p ,…,1−p 11
H(p1, 1 − p1): entropy for a random variable corresponding to “Is X = 1?” 1 − p1: probability of X ̸= 1
p2 ,…, p|X| : conditional probability of X = 2,…,|X| given X ̸= 1.
, . . . , p|X | 􏰊: entropy for a random variable corresponding to
outcomes when X ̸= 1.

Decomposability of Entropy Generalization
In general, we have that for any m between 1 and |X | − 1: m |X| 
 H(p)=H 􏰄pi, 􏰄 pi
i =1 i =m+1
􏰎m 􏰏 􏰂 p1 pm 􏰃
+ 􏰄pi H 􏰐m p,…,􏰐m p i=1 i=1 i i=1 i
|X| 􏰎pm+1 p|X| 􏰏 + 􏰄pi H 􏰐 ,…,􏰐
  |X| p |X| p i=m+1 i=m+1 i i=m+1 i
Apply this formula with m = 1, |X| = 3, p = (p1,p2,p3) = (1/2,1/4,1/4) 10/31

1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information Definition
Joint and Conditional Mutual Information
4 Wrapping up

Entropy in Information Theory
If a random variable has distribution p, there exists an encoding with an average length of
H(p) bits and this is the “best” possible encoding
What happens if we use a “wrong” encoding?
e.g. because we make an incorrect assumption on the probability distribution
If the true distribution is p, but we assume it is q, it turns out we will need to use
H(p) + DKL(p∥q) bits
where DKL(p∥q) is some measure of “distance” between p and q

Relative Entropy
Definition
The relative entropy or Kullback-Leibler (KL) divergence between two probability distributions p(X ) and q(X ) is defined as:
DKL(p∥q) = =
p(x) log q(x) − log p(x)
p(x) 􏰋 p(X)􏰌
p(x)log q(x) = Ep(X) log q(X) . 􏰜 Both p(X ) and q(X ) are defined over the same alphabet X
Conventions on log likelihood ratio:
0 def 0 def p def 0log0 =0 0logq =0 plog0 =∞

Relative Entropy Properties
DKL(p∥q) ≥ 0 (proof next lecture) DKL(p∥q) = 0 ⇔ p = q (proof next lecture)
Not symmetric: DKL(p∥q) ̸= DKL(q∥p)
Not satisfy triangle inequality: DKL(p∥q) ̸= DKL(p∥r) + DKL(r∥q)
􏰜 Not a true distance since is not symmetric and does not satisfy the
triangle inequality
􏰜 Hence, “KL divergence” rather than “KL distance”

Relative Entropy Uniform q
Let q correspond to a uniform distribution: q(x) = 1 |X|
Relative entropy between p and q:
DKL(p∥q) = 􏰄 p(x) log p(x)
= 􏰄p(x)·(logp(x)+log|X|)
= −H (X ) + 􏰄 p(x ) · log |X | x∈X
= −H (X ) + log |X |.
Matches intuition as penalty on number of bits for encoding

Relative Entropy
Example (from Cover & Thomas, 2006)
Let X ∈ {0, 1} and consider the distributions p(X ) and q(X ) such that:
p(X =1)=θp p(X =0)=1−θp q(X =1)=θq q(X =0)=1−θq
What distributions are these?
Compute DKL(p∥q) and DKL(q∥p) with θp = 21 and θq = 14

Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d
DKL(p∥q)=θplogθp +(1−θp)log1−θp θq 1−θq
1 12 1 12 1
=2log1 +2log3 =1−2log3≈0.2075bits
DKL(q∥p)=θqlogθq +(1−θq)log1−θq θp 1−θp
1 14 3 43 3
=4log1 +4log1 =−1+4log3≈0.1887bits

1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information Definition
Joint and Conditional Mutual Information
4 Wrapping up

Mutual Information Definition
Let X , Y be two r.v. with joint distribution p(X , Y ) and marginals p(X ) and p(Y):
Definition
The mutual information I(X ; Y ) is the relative entropy between the joint distribution p(X , Y ) and the product distribution p(X )p(Y ):
I(X;Y) = DKL (p(X,Y)∥p(X)p(Y)) =􏰄􏰄p(x,y)log p(x,y)
x∈X y∈Y p(x)p(y) Non-negativity: I(X;Y) ≥ 0
Symmetry: I(Y;X) = I(X;Y)
Intuitively, how much information, on average, X conveys about Y .

Relationship between Entropy and Mutual Information
We can re-write the definition of mutual information as:
I(X;Y)=􏰄􏰄p(x,y)log p(x,y) x∈X y∈Y p(x)p(y)
= 􏰄 􏰄p(x,y)log p(x|y) x∈X y∈Y p(x)
 = − 􏰄 log p(x) 􏰄 p(x, y) − − 􏰄 􏰄 p(x, y) log p(x|y)
x∈X y∈Y x∈X y∈Y = H(X) − H(X|Y)
The average reduction in uncertainty of X due to the knowledge of Y . Self-information: I(X;X)=H(X)−H(X|X)=H(X)

Mutual Information: Properties
Mutual Information is non-negative:
I(X;Y) ≥ 0 Mutual Information is symmetric:
I(X;Y) = I(Y;X)
Self-information:
I(X;X) = H(X) Since H (X , Y ) = H (Y ) + H (X |Y ) we have that:
I (X ; Y ) = H (X ) − H (X |Y ) = H (X ) + H (Y ) − H (X , Y )

Breakdown of Joint Entropy
(From Mackay, p140; see his exercise 8.8)

Mutual Information Example 1 (from Mackay, 2003)
Let X,Y,Z be r.v. with X,Y ∈ {0,1}, X ⊥ Y and:
p(X =0)=p p(X =1)=1−p p(Y =0)=q p(Y =1)=1−q
Z=(X+Y) mod2
(a) ifq=1/2whatisP(Z =0)?P(Z =1)?I(Z;X)?
(b) For general p and q what is P(Z = 0)? P(Z = 1)? I(Z;X)?

Mutual Information
Example 1 (from Mackay, 2003) — Solution (a)
(a) AsX ⊥Y andq =1/2thenoisewillfliptheoutcomeofX with probability q = 0.5 regardless of the outcome of X . Therefore:
p(Z =1)=1/2 p(Z =0)=1/2
Indeedforq =1/2weseethatZ ⊥X
I (X ; Z ) = H (Z ) − H (Z |X ) = 1 − 1 = 0

Mutual Information
Example 1 (from Mackay, 2003) — Solution (b)
l=p(Z =0)=p(X =0)×p(noflip)+p(X =1)×p(flip)
= pq + (1 − p)(1 − q) = 1 + 2pq − q − p
Similarly:
= q + p − 2pq
I (Z ; X ) = H (Z ) − H (Z |X ) =H(l,1−l)−H(q,1−q) why?
p(Z =1)=p(X =1)×p(noflip)+p(X =0)×p(flip) = (1 − p)q + p(1 − q)

1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information Definition
Joint and Conditional Mutual Information
4 Wrapping up

Joint Mutual Information
Recall that for random variables X , Y ,
I (X ; Y ) = H (X ) − H (X |Y )
Reduction in uncertainty in X due to knowledge of Y
More generally, for random variables X1,…,Xn,Y1,…,Ym,
I(X1, . . . , Xn; Y1, . . . , Ym) = H(X1, . . . , Xn) − H(X1, . . . , Xn|Y1, . . . , Ym) ReductioninuncertaintyinX1,…,Xn duetoknowledgeofY1,…,Ym
Symmetry also generalises:
I(X1,…,Xn;Y1,…,Ym) = I(Y1,…,Ym;X1,…,Xn)

Conditional Mutual Information
The conditional mutual information between X and Y given Z = zk :
I(X;Y|Z =zk)=H(X|Z =zk)−H(X|Y,Z =zk). Averaging over Z we obtain:
The conditional mutual information between X and Y given Z : I (X ; Y |Z ) = H (X |Z ) − H (X |Y , Z )
= Ep(X,Y,Z) log p(X,Y|Z) p(X|Z)p(Y|Z)
The reduction in the uncertainty of X due to the knowledge of Y when Z is given.
Note that I(X;Y;Z), I(X|Y;Z) are illegal terms while e.g. I(A,B;C,D|E,F) is legal.

1 Decomposability of Entropy
2 Relative Entropy / KL Divergence
3 Mutual Information Definition
Joint and Conditional Mutual Information
4 Wrapping up

Decomposability of entropy
Relative entropy
Mutual information
Reading: Mackay §2.5, Ch 8; Cover & Thomas §2.3 to §2.5

Mutual information chain rule Jensen’s inequality “Information cannot hurt” Data processing inequality

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com