Prof. Mark Bun
CAS CS 591 B: Communication Complexity
Lecture Notes 7: Introduction to Information Complexity
Fall 2019
Reading.
• Rao-Yehudayoff Chapter 6
1 A Few More Facts about Mutual Information
Lemma 1.
I(A; B) = I(A; B|C) − I(A; C|B) + I(A; C) Proof. By applying the chain rule two different ways,
I(A; BC) = I(A; C) + I(A; B|C) = I(A; B) + I(A; C|B). Rearranging gives the identity.
The following claim states that post-processing cannot increase the amount of information one random variable reveals about another. (Think, e.g., of C = f(B) for a randomized function f, where the randomness in f is independent of everything else.)
Lemma 2 (Information Processing Inequality). Let A → B → C be a Markov chain, i.e., C is independent from A conditioned on B. Then I(A; C) ≤ I(A; B).
Proof.
I(A; C) ≤ I(A; C) + I(A; B|C) = I(A; BC) = I(A; B) + I(A; C|B) = I(A; B), where the last equality follows because I(A; C|B) = 0 by definition.
2 KL Divergence
Mutual information gives us a way of measuring how far two random variables are from being independent. A related way to measure similarity between two distributions is via the notion of KL divergence, or relative entropy.
Let A,B be random variables over the same sample space X, with probability mass functions p and q, respectively. Then we define
Definition 3 (KL Divergence).
D(A∥B) = p(x) log p(x) = Ex∼A log p(x)
x∈X :p(x)>0
q(x) q(x) 1
Like entropy, KL divergence has an intuitive interpretation in terms of coding messages. We may write
D(A∥B)=Ex∼Alog 1 −Ex∼Alog 1 . q(x) p(x)
The second term is simply the entropy of A, i.e., the minimum expected length for an encoding of A. The first term we can think of as the expected length of an encoding of A using a code which was optimized for B. So KL divergence captures the loss of using a code designed for B rather than a code designed for A.
Let’s record a few facts about KL divergence:
• D(A∥A) = 0
• D(A∥B) ≥ 0. This follows by Jensen:
p(x) p(x) −Ex∼A log q(x) ≤ − log Ex∼A q(x)
q(x) = log Ex∼A p(x)
= log 1 = 0.
• Unlike mutual information, D(A∥B) ̸= D(B∥A) in general
• If A is supported on a point outside of the support of B, then D(A∥B) = ∞
KL divergence also has a nice connection to mutual information. For jointly distributed random variables A, B, let A ⊗ B denote the product distribution with marginals A and B.
Theorem 4. I(A;B)=D(AB∥A⊗B) Proof.
D(AB∥A⊗B)=p(x,y)log p(x,y) x,y p(x)p(y)
111
We are now ready to use information theory to define notions of information cost for communication protocols. Let μ be a distribution over X × Y and let Π be a communication protocol using private randomness RA and RB for Alice and Bob respectively, and public randomness R. Define the tran- script of the protocol, denoted T(x,y,rA,rB,r), to consist of the public randomness string followed by the sequence of messages exchanged between Alice and Bob. Let AB be jointly distributed over X × Y according to μ.
The first way one might model the information cost of a protocol is in terms of how much information about Alice and Bob’s inputs is revealed to an external observer.
2
p(x,y)log p(x) + p(y) − p(x,y)
=
= H(A) + H(B) − H(AB) = I(A; B).
x,y
3 Information Cost
Definition 5. The external information cost of a protocol Π with respect to a distribution μ, denoted ICext(Π) = I(AB;T).
It turns out that there is another way to model information cost which is easier to work with. This is the notion of internal information cost, or the amount Alice and Bob learn about each others’ inputs over the course of the protocol.
Definition 6. The (internal) information cost of a protocol Π with respect to a distribution μ, denoted ICμ(Π) = I(A; T |B) + I(B; T |A).
You may (very reasonably) ask why the conditioning does not include Alice and Bob’s private randomness. This is because conditioning on private randomness does not affect the information cost:
Lemma 7. I(A;T|BRB) = I(A;T|B)
Proof. We prove the claim by induction on the number of rounds of the protocol. Let T≤k denote the prefix of the transcript through round k. The claim is clearly true for T0, which consists only of the public randomness. Suppose it is Bob’s turn to speak in round k and the claim is true through round k − 2, so our induction hypothesis says
I(A;T≤k−1|BRB) = I(A;T≤k−1|B). By applying the chain rule twice,
I(A; T≤k|B) = I(A; T≤k−1|B) + I(A; Tk|BT≤k−1)
= I(A; T≤k−2|B) + I(A; Tk−1|BT≤k−2) + I(A; Tk|BT≤k−1)
Similarly, we can write
I(A;T≤k|BRB) = I(A;T≤k−2|BRB) + I(A;Tk−1|BRBT≤k−2) + I(A;Tk|BRBT≤k−1).
The third term of each identity is zero, since it’s Bob’s turn to speak in round k: Alice’s input does not affect Tk except through T≤k−1 which is already being conditioned on. By the induction hypothesis, it’s enough to show that
I(A;Tk−1|BT≤k−2) = I(A;Tk−1|BRBT≤k−2). To see this, we use Lemma 1 to write
I(A;Tk−1|BT≤k−2) = I(A;Tk−1|BT≤k−2RB) − I(RB;Tk−1|BT≤k−2A) + I(RB;Tk−1|BT≤k−2). The last two terms are zero because it is Alice’s turn to speak in round k−1; hence, Bob’s randomness
RB does not affect Tk−1 except through T≤k−2.
4 Information vs. Communication
We establish the following relationships showing that information lower bounds communication. Let CCμ(Π) denote the expected number of bits exchanged under Π for inputs chosen from μ.
3
μ
Theorem 8. For every distribution μ,
IC (Π) ≤ ICext(Π) ≤ CC (Π).
μμμ
Proof. We begin with the first inequality. By repeatedly applying the chain rule, we get L
ICext(Π) = I(AB;T) = I(AB;T |T ),
k=1
where L is the length of the protocol. By the chain rule and non-negativity of mutual information,
μ
k ≤k−1
we have
Now observe that if it is Alice’s turn to speak in round k, then I(B;Tk|AT≤k−1) = 0. Similarly, if
it’s Bob’s turn to speak, then I(A;Tk|BT≤k−1) = 0. So we actually have I(AB; Tk|T≤k−1) ≥ I(A; Tk|BT≤k−1) + I(B; Tk|AT≤k−1).
Repeatedly applying the chain rule again lets us conclude
L
ICext(Π) = I(AB; T |T )
k=1 L
≥ I(A; Tk|BT≤k−1) + I(B; Tk|AT≤k−1) k=1
= I(A; T |B) + I(B; T |A) = ICμ(Π).
I(AB; Tk|T≤k−1) ≥ max {I(A; Tk|BT≤k−1), I(B; Tk|AT≤k−1)} .
μ
k ≤k−1
Next, to bound the external information by the communication, we simply observe:
ICext(Π)=I(Π ;AB)+I(Π ;AB|Π )=I(Π ;AB|Π )≤H(Π ), μ 0 >0 0 >0 0 >0
and the latter is at most the average total length of messages sent between Alice and Bob.
4