SECTION A.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
1. (4 points) Suppose X and Y are random variables with outcomes {0, 1, . . . , 10}. The
figure below shows a Hinton diagram of the joint distribution of X and Y , with the area
of the white squares at the cell for (x, y) proportional to the corresponding P (X = x, Y =
y). (Wherever there is no visible white square at the cell for (x, y), P (X = x, Y = y) is
close to 0.)
Is it possible for X and Y to be statistically independent? Justify your answer.
No. If they were independent, we would have that p(Y = y|X = x) = p(Y = y). That
is, the distribution over the Y ’s would be the same for everyX value. But this is evidently
not so from the diagram, e.g. looking at the first and second columns.
2. (6 points) Let X and Y be two random variables, both with possible outcomes {0, 1}.
(a) (2 pts) Does specifying P (X|Y ) and P (Y ) fully determine P (X, Y )?
Yes, because P (X, Y ) = P (X|Y )P (Y ) by definition of joint probability.
(b) (4 pts) Does specifying P (X|Y ) and P (Y |X) fully determine P (X, Y )?
Yes, because if we know P (X|Y ) and P (Y |X), we can work out P (Y ) by inverting
Bayes’ rule. In particular we have p(X|Y=1)
p(X|Y=0) ·
p(Y=1)
p(Y=0)
=
p(Y=1|X)
p(Y=0|X) . Note that this crucially
exploits the simple structure of binary random variables.
For both questions, if your answer is yes, then write down the formula of computing
P (X, Y ) from the given quantities. Otherwise, give a counter-example.
3. (15 points) The Journal of Information Theory Research reviews all submitted papers
in the following process. An editor first assigns the paper to two reviewers, who make
the recommendation of either acceptance (1) or rejection (0). Based on their reviews, the
editor makes the final decision. Even if the two reviews are consistent, the editor may still
make the opposite decision. Let Z be the editor’s decision, andX and Y be the reviewers’
recommendation. Assume X and Y are independent and P (X = 1) = P (Y = 1) = 0.5.
The conditional probability P (Z = 0|X, Y ) is given by
(a) (6 pts) Compute P (X = 1|Z = 0), showing all your working.
We have P (X = 1|Z = 0) = P (Z = 0|X = 1)P (X = 1)/P (Z = 0). Now
P (Z = 0|X = 1) = P (Z = 0|X = 1, Y = 0)P (Y = 0) + P (Z = 0|X =
Page 2 of 5 – INFORMATION THEORY – COMP2610/6261
P (Z = 0 | X, Y ) X = 0 X = 1
Y = 0 0.9 0.5
Y = 1 0.5 0.1
1, Y = 1)P (Y = 1), which is 0.25 + 0.05 = 0.30. Similarly P (Z = 0|X = 0) =
0.45 + 0.25 = 0.70. So, P (Z = 0) = 0.5. Thus P (X = 1|Z = 0) = 0.30.
(b) (2 pts) If from the above you find P (X = 1|Z = 0) > P (X = 1), explain intu-
itively why the probability increased; else, if you find P (X = 1|Z = 0) < P (X =
1), explain why the probability decreased.
It decreased. The reason is that it is unlikely that X voted yes, since that has a
smaller chance of leading to Z = 0.
(c) (5 pts) Compute P (X = 1|Z = 0, Y = 1), showing all your working.
This is P (Z = 0|X = 1, Y = 1)P (X = 1|Y = 1)/P (Z = 0|Y = 1). The
denominator is P (Z = 0|X = 1, Y = 1)P (X = 1|Y = 1) +P (Z = 0|X = 0, Y =
1)P (X = 0|Y = 1) = 0.30. Thus it is (0.1)(0.5)/0.3 = 5/30 = 1/6 = 0.166....
(d) (2 pts) If you find P (X = 1|Z = 0, Y = 1) > P (X = 1|Z = 0), explain intuitively
why the probability increased; else, explain why the probability decreased.
It decreased. If we know that the editor voted to reject, we have some belief how
likely it is that X voted for acceptance (30% chance from (a)). If we further learn
that Y voted for acceptance, it seems unlikely that X also voted for acceptance,
since the editor only very rarely overturns two recommendations for acceptance i.e.
p(Z = 0|X = 1, Y = 1) is very small.
Page 3 of 5 – INFORMATION THEORY – COMP2610/6261
SECTION B.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
1. (5 points) Let X, Y be two random variables with the following joint distribution:
P (X, Y ) X
1 2
Y
1 1/4 0
2 1/4 1/2
(a) (3 pts) Compute H(Y |X).
We have p(X = 1) = 1/2, and p(X = 2) = 1/2. Also, p(Y = 1|X = 1) = 1/2
and p(Y = 1|X = 2) = 0. So, H(Y |X) = 0.5 · (H(1/2) +H(0)) = 0.5.
(b) (2 pts) Compute I(X;Y ). (You may express your answer in terms of log2 of an
appropriate integer.)
We have I(X;Y ) = H(Y )−H(Y |X). We have p(Y = 1) = 1/4 + 0 = 1/4. Also,
p(Y = 2) = 1/4 + 1/2 = 3/4. So, I(X;Y ) = H2(1/4)− 0.5.
2. (6 points) Let X1 and X2 be random variables with possible outcomes X . Suppose that
these variables are identically distributed, i.e. that P (X1 = x) = P (X2 = x) for all
x ∈ X . However, we do not assume X1 and X2 are independent. Now let
ρ = 1−
H(X2|X1)
H(X1)
.
(a) (2 pts) Prove that ρ = I(X1;X2)
H(X1)
.
This is because H(X1)−H(X2|X = 1) = H(X2)−H(X2|X = 1) = I(X1;X2).
(b) (2 pts) Hence, or otherwise, prove that 0 ≤ ρ ≤ 1.
This is because 0 ≤ I(X1;X2) ≤ H(X1).
(c) (1 pt) When is ρ = 0? Justify your answer.
This is when I(X1;X2) = 0, i.e. when the two variables are independent.
(d) (1 pt) When is ρ = 1? Justify your answer.
This is when I(X1;X2) = H(X1), or H(X2|X1) = H(X2), i.e. when the two
variables are identical.
3. (6 points) The post office in Union Court of ANU handles 10,000 letters per day on
average.
(a) (3 pts) Use Markov’s inequality to derive an upper bound on the probability that
more than 12,000 letters will be handled tomorrow. (Leave your answer as a frac-
tion.)
This is P (X ≥ 12000) ≤ 10000
12000
= 5
6
.
(b) (3 pts) Suppose the variance of letters processed is 2,000. Use Chebyshev’s in-
equality to derive an upper bound on the probability that this post office will handle
between 8,000 and 12,000 letters tomorrow. (Leave your answer as a fraction.)
We know that
P (|X − 10000| ≥ 2000) ≤
2000
4000000
.
Page 4 of 5 – INFORMATION THEORY – COMP2610/6261
The question then requires we compute P (8000 < X < 12000) = P (|X−10000| < 2000) = 1− P (|X − 10000| ≥ 2000). 4. (8 points) Recall that a 3-tuple of random variables (X, Y, Z) form a Markov chain if and only if p(X, Y, Z) = p(X) p(Y | X) p(Z | Y ). Similarly, a 4-tuple of random variables (X, Y, Z,W ) form a Markov chain if and only if p(X, Y, Z,W ) = p(X) p(Y | X) p(Z | Y ) p(W | Z). In what follows, suppose (X, Y, Z,W ) forms a Markov chain. (a) (2 pts) Prove that (X, Y, Z) forms a Markov chain. We have p(X, Y, Z) = ∑ w p(X, Y, Z,W = w) = ∑ w p(X) p(Y | X) p(Z | Y ) p(W = w | Z) = p(X) p(Y | X) p(Z | Y ) ∑ w p(W = w | Z) = p(X) p(Y | X) p(Z | Y ). This is precisely the definition of (X, Y, Z) being a Markov chain. (b) (4 pts) Prove that (X,Z,W ) forms a Markov chain. Note that since (X, Y, Z) is a Markov chain, p(Z|Y ) = p(Z|Y,X). Keeping this in mind, we have p(X,Z,W ) = ∑ y p(X, Y = y, Z,W ) = ∑ y p(X) p(Y = y | X) p(Z | Y = y) p(W | Z) = p(X) p(W | Z) ∑ y p(Y = y | X) p(Z | Y = y) = p(X) p(W | Z) ∑ y p(Y = y | X) p(Z | Y = y,X) = p(X) p(W | Z) ∑ y p(Z, Y = y | X) = p(X) p(W | Z) p(Z | X). This is precisely the definition of (X,Z,W ) being a Markov chain. (c) (2 pts) Prove that I(X;W ) ≤ I(Y ;Z). (You may use without proof the following result from Tutorial 5: if (X, Y, Z) forms a Markov chain, then (Z, Y,X) also forms a Markov chain, and so I(X;Z) ≤ min(I(X;Y ), I(Y ;Z)). You may also use the results of (a) and (b), even if you are unable to prove them.) From (a), (X, Y, Z) is a Markov chain. From the statement in tutorials, I(X;Z) ≤ I(Y ;Z). Page 5 of 5 – INFORMATION THEORY – COMP2610/6261 From (b), (X, Y,W ) is a Markov chain. From the statement in tutorials, I(X;W ) ≤ I(X;Z). Combining the two inequalities gives the result. Page 6 of 5 – INFORMATION THEORY – COMP2610/6261 SECTION C. Answer each of the following questions [ Marks per questions as shown; 25% total ] 1. (5 pts) Suppose X is an ensemble over three possible outcomes, with probabilities pX = (0.4, 0.3, 0.3). Recall that XN denotes an extended ensemble. (a) (1 pt) Compute the raw bit content H0(X). (You may express your answer in terms of log2 of an appropriate integer.) This is log2 3. (b) (1 pt) What quantity, if any, does 1 N H0(X N) converge to as N → ∞? Justify your answer. We have H0(XN) = log |AXN | = log |ANX | = N log |AX | = NH0(X). So this stays at the constant value H0(X). (c) (2 pts) Compute the essential bit content Hδ(X) when δ = 0.35. If δ = 0.35 we want the smallest set with at least 0.65 of the probability mass. This is obtained by throwing out either element with smallest probability 0.3, leaving behind two elements. We thus find Hδ(X) = log2 2 = 1. (d) (1 pt) What quantity, if any, does 1 N Hδ(X N) converge to as N → ∞ for δ = 0.35? Justify your answer. The entropy H(X), by the source coding theorem. 2. (12 pts) Suppose X is an ensemble over five possible outcomes, with probabilities pX = (0.3, 0.3, 0.2, 0.1, 0.1). (a) (4 pts) Compute a Huffman code C for X . Show all your working. Initially we merge d and e to get a new symbol with probability 0.2 Then we merge ”de” and c to get a new symbol with probability 0.4. Then we merge a and b to get a new symbol with probability 0.6. Then we merge these two symbols. Thus, a→ 10, b→ 11, c→ 01, d→ 000, e→ 001. (b) (2 pts) Compute the expected length L(C,X) for your Huffman code. This is 2 · 2 · 0.3 + 2 · 0.2 + 2 · 3 · 0.1 = 1.2 + 0.4 + 0.6 = 2.2. (c) (2 pts) Explain the relationship between L(C,X) and the entropy H(X). We know H(X) ≤ L(C,X), so H(X) ≤ 2.2. In fact H(X) = 2.1710. (d) (1 pt) Guffman, a self-trained mathematician, claims he has discovered a new pre- fix code C ′ which has expected length L(C ′, X) strictly smaller than L(C,X). By referencing an appropriate theorem from lectures, explain whether his claim is pos- sible. Not possible. Huffman codes have shortest expected length out of all prefix codes. (e) (1 pt) Hoffman, a self-trained quant, claims that he has constructed a new prefix code C ′′ has codeword lengths (1, 1, 2, 2, 2). By referencing an appropriate theorem from lectures, explain whether his claim is possible. Not possible, because prefix code implies ∑ 2−` ≤ 1, not the case here. (f) (2 pts) Suppose we compute a Shannon codeC ′′′ forX . Should we expectL(C ′′′, X) = H(X)? Explain why or why not. No, because the probabilities are not powers of two, so the codeword lengths will be larger than the log probabilities. Page 7 of 5 – INFORMATION THEORY – COMP2610/6261 3. (6 pts) Suppose X is an ensemble over outcomes {a,b,c,d}. Let the probabilities pX = (0.25, 0.5, 0.125, 0.125). (a) (2 pts) Compute the codeword lengths for all outcomes under a Shannon-Fano-Elias code for X . Lengths are dlog 1/p(x)e+ 1 = (3, 2, 4, 4). (b) (4 pts) Compute the Shannon-Fano-Elias codewords for a and b, showing all your working. (You do not need to compute codewords for c and d.) We have F̄ (a) = 0.125 = 0.001, F̄ (b) = 0.25 + 0.25 = 0.5 = 0.10. So, a→ 001, b → 10. 4. (2 pts) Briefly describe one potential advantage of arithmetic coding over Huffman cod- ing. It can adapt to changing probability distributions, and does not assume the probabilities stay the same at every single iteration. Page 8 of 5 – INFORMATION THEORY – COMP2610/6261 SECTION D. Answer each of the following questions [ Marks per questions as shown; 25% total ] 1. (2 pts) Consider a binary symmetric channel with bit flip probability f = 0.25. (a) (1 pt) Write down the transition matrix Q for this channel. Q = [ 1− f f f 1− f ] = [ 0.75 0.25 0.25 0.75 ] . (b) (1 pt) What input distribution achieves the capacity of the channel? (You may refer to a result from lectures.) A uniform distribution, since the channel is symmetric. 2. (5 pts) Let Q denote some channel over input alphabet X = {0, 1} and output alphabet Y = {0, 1}. (a) (1 pt) Donald computes the mutual information I(X;Y ) using an input distribution pX = (0.5, 0.5). He finds that for this distribution, I(X;Y ) = 0.9. Based on this fact, what is a lower bound on the capacity of Q? Justify your answer. The capacity is the maximal I(X;Y ). So, the capacity is at least 0.9. (b) (2 pts) Provide an upper bound on the capacity of Q. Justify your answer. The capacity is the maximal I(X;Y ), which is at most H(Y ), which is at most 1 bit. (c) (2 pts) Carly claims she can construct a block code that achieves a rate of 1.8 bits per transmission, with arbitrarily small probability of block error. Is her claim possible? Explain why or why not. No, by the NCCT we cannot achieve rates above the capacity. And the capacity is at most 1 bit per transmission. 3. (14 pts) Consider a channel over inputs X = {a,b,c,d} and outputs Y = {a,b,c,d}, with transition matrix Q = 1 2 1 2 0 0 1 2 1 2 0 0 0 0 1 0 0 0 0 1 , and input distribution pX = (pa, pb, pc, pd). (a) (2 pts) Is the channel Q symmetric? Explain why or why not. No. We cannot partition the outputs such that the rows and column are permutations of each other. e.g. if we try the first two outputs, then the rows are permutations of each other, but not the columns. (b) (3 pts) Compute the probability distribution p(Y ) over outputs. (Express your an- swer in terms of pa, pb, pc, pd.) First, p(Y = a) = p(Y = b) = (pa + pb)/2. Next, p(Y = c) = pc, and p(Y = d) = pd. (c) (3 pts) Compute the conditional entropy H(Y | X). (Express your answer in terms of pa, pb, pc, pd.) First, H(Y |X = a) = H(Y |X = b) = H(1/2) = 1. Next, H(Y |X = c) = H(Y |X = d) = 0. So, H(Y |X) = pa + pb. Page 9 of 5 – INFORMATION THEORY – COMP2610/6261 (d) (3 pts) Hence, show that I(X;Y ) = −(1− pc − pd) · log2(1− pc − pd)− pc · log2 pc − pd · log2 pd. We know I(X;Y ) = H(Y )−H(Y |X). First, H(Y ) = −(pa + pb) · log(pa + pb) + (pa + pb)− pc log pc − pd log pd. So, I(X;Y ) = −(pa + pb) · log(pa + pb)− pc log pc − pd log pd. Now clearly pa + pb = 1− pc − pd. So, the result follows. (e) (3 pts) What input distribution achieves the capacity of Q? Explain your answer intuitively. The above is just the standard entropy for a random variable with distribution (pc, pd, pe) where pe = 1 − pc − pd. So, it must be maximised by a uniform distribution over these outcomes, i.e. with pc = pd = pe = 1/3. Note here that the precise choice of pa, pb does not matter: by picking pc = pd = 1/3 we constrain the sum of these two outcomes to be 1/3, but otherwise the precise values are arbitrary. This makes sense, because outcomes a and b are essentially interchangeable, and can be considered as one new symbol. 4. (4 pts) Suppose we use a (7, 4) Hamming code to communicate over a binary symmetric channel with nonzero flip probability. (a) (3 pts) Compute the three parity bits for the message 1001. You may use a diagram to show your working. It should be 110. (b) (1 pt) Suppose a receiver sees the bit string 1001b1b2b3, where b1b2b3 is the parity bit string you computed above. Is it guaranteed that the sender actually transmitted 1001? Explain why or why not. No, it is not guaranteed. There could have been three or more bit flips starting from another codeword. Page 10 of 5 – INFORMATION THEORY – COMP2610/6261