COMP2610 / COMP6261 — Information Theory:
Assignment 1
Robert C. Williamson
Out: 29 August 2018 Due: 5pm, 21 September 2018
Instructions
This assignment contributes 20% to your final score for the course.
Marks: The maximum marks available are given in square brackets. Each question is worth 25
marks in total. For questions where you are asked to prove results, if you can not prove a
precedent part, you can still attempt subsequent parts of the question assuming the truth
of the earlier part.
COMP2610 students: Answer questions 1-4 only. You are not expected to answer
question 5.
COMP6261 students: Answer questions 4 and 5 plus [two questions chosen from Q1,
Q2, or Q3]. You will only get marks for 4 questions (two of which must be Q4 and Q5).
Submission: You should submit a paper copy of your assignment to the assignment submission
box on the ground floor of the CSIT building by the due date. It is not necessary to type
your answers, but if you use hand writing it must be clearly legible.
You must include your name, UNI-ID and your Tutor group clearly at the top of the first
page of your submission.
You must show all your working: it is not sufficient to merely write down an answer!
Cheating and Plagiarism: All assignments must be done individually. Plagiarism is a univer-
sity offence and will be dealt with according to university procedures
http://academichonesty.anu.edu.au/UniPolicy.html.
1
http://academichonesty.anu.edu.au/UniPolicy.html
1. In the game of Gah! Bad DECAF! a player draws letter tiles out of a bag. Each of the 16
tiles in the bag has a letter l ∈ {A, B,C,D, E, F,G,H} on one side and a value v ∈ {1, 3, 5}
on the other side. The number of tiles for each letter that appear in the bag, and the
associated value for each letter are shown in the table below:
Letter A B C D E F G H
Count 4 2 1 2 4 1 1 1
Value 1 5 3 3 1 5 3 5
Let d = {yes, no} indicate whether a tile is a DECAF tile or not. That is, d = yes if
l ∈ {D, E,C, A, F} and d = no, otherwise. The uppercase symbols L, V , and D will be
used to respectively denote the ensembles associated with the letter, value, and DECAF
status of a tile randomly drawn from the bag. All tiles in a bag are equally likely to be
chosen when a draw is made.
(a) Suppose a single tile is drawn from the bag. Calculate:
i. The information in seeing an A. That is, compute h(l = A). [2 points]
ii. The information in seeing a value of 3. That is, h(v = 3). [2 points]
iii. The conditional information h(l = A|v = 5) of flipping a tile over and seeing an
A given that a tile with value 1 was drawn. [2 points]
iv. The conditional information h(v = 1|l = A) of flipping a tile over and seeing a
1 given that an A tile was drawn. Give an intuitive explanation for this answer.
[2 points]
v. The entropies H(L), H(V) and H(D). [2 points]
(b) Calculate the mutual information I(L; V) between the letter and value of a tile, and
the mutual information I(D; V) between the DECAF status and value of a tile. Using
the data processing inequality, explain the relationship between these two quantities.
[5 points]
(c) Exactly half of all people who play Gah! Bad DECAF! cheat by secretly throwing
out all their As and Es, leaving behind a bag with only 8 of the 16 original tiles. We
will write c = unfair if a person cheats and c = fair otherwise and let C be the
associated ensemble with p(fair) = p(unfair) = 0.5.
i. What is the conditional entropy H(D |C)? [2 points]
ii. Use the previous result to determine H(C,D). [3 points]
iii. Suppose you were playing against a randomly chosen opponent and she plays a
DECAF letter (i.e., d = yes) three times in a row, replacing her tile back in her
bag after each play. What probability would you assign to her being a cheat?
[5 points]
2
2. Let Y be a random variable with possible outcomes {0, 1}, and p(Y = 1) = 12 . Let X be a
random variable with possible outcomes X = {a, b, c}. Define
p = (p(X = a|Y = 1), p(X = b|Y = 1), p(X = c|Y = 1))
q = (p(X = a|Y = 0), p(X = b|Y = 0), p(X = c|Y = 0)).
(a) Suppose that
p =
(
1
2
,
1
4
,
1
4
)
q =
(
1
4
,
1
2
,
1
4
)
.
Compute I(X;Y ). [3 points]
(b) Using the definition of mutual information, show that for any choice of p, q,
I(X;Y ) =
DKL(p| |m) + DKL(q| |m)
2
,
where m = (p + q)/2. [10 points]
(c) Let Z be a random variable with outcomes x ∈ X, such that for all x ∈ X and
y ∈ {0, 1}, p(Z = x |Y = y) = p(X = x |Y = 1 − y). Using part (b) compute I(Z;Y ).
Explain intuitively the relation of your answer to part (a). [4 points]
(d) Suppose p and q are as in part (a). Using part (b), or otherwise, give an example of a
random variable Z with possible outcomes X satisfying I(X;Y ) > I(Z;Y ). Explain
your answer in terms of the data-processing inequality. [4 points]
(e) Suppose p and q are as in part (a). Give an example of a random variable Z with
possible outcomes x ∈ X satisfying I(X;Y ) < I(Z;Y ). Explain your answer in
terms of the data-processing inequality. [4 points]
3
3. [25 points] Suppose a music collection consists of 4 albums: the album Alina has 5
tracks; the album Bends has 12; the album Coieda has 15; and the album Debut has 12.
(a) How many bits would be required to uniformly code:
i. The index of all the albums? (We are not encoding the actual music, but merely
the titles — the metadata). Give an example uniform code for the albums.
[2 points]
ii. Only the tracks in the album Alina. Give an example of a uniform code for the
tracks assuming they are named “Track 1”, “Track 2”, etc. [2 points]
iii. All the tracks in the music collection? [2 points]
(b) What is the raw bit content required to distinguish all the tracks in the collection?
[2 points]
(c) Suppose every track in themusic collection has an equal probability of being selected.
Let A denote the album title of a randomly selected track from the collection.
i. Write down the ensemble for A — that is, its alphabet and probabilities.
[2 points]
ii. What is the raw bit content of A4? [2 points]
iii. What is the smallest value of δ such that the smallest δ-sufficient subset of A4
contains fewer than 256 elements? [3 points]
iv. What is the largest value of δ such that the essential bit content Hδ(A4) is strictly
greater than zero? [4 points]
(d) Suppose the album titles ensemble A is as in part (c).
i. Compute an approximate value for the entropy H(A) to two decimal places (you
may use a computer or calculator to obtain the approximation but write out the
expression you are approximating). [2 points]
ii. Approximately howmany elements are in the typical setTNβ for Awhen N = 100
and β = 0.1? [2 points]
iii. Is it possible to design a uniform code to send large blocks of album titles with
a 95% reliability using at most 1.5 bits per title? Explain why or why not.
[2 points]
4
4. Let X be a real-valued random variable with mean µ and standard deviation σ < ∞. The
median of X is the real number m ∈ R satisfying
p(X ≥ m) = p(X ≤ m) =
1
2
.
If g is some function on R, then arg minx g(x) is the value x
∗ that minimises g(x): that is
g(x∗) ≤ g(x) for all x, and minx g(x) = g(x∗). The argmin may not be unique, and so we
consider it to be set-valued and thus write x∗ ∈ arg minx g(x).
(a) Prove that
m ∈ arg min
c∈R
E (|X − c |)
(Hint: break the expectation into two conditional expectations.) [5 points]
(b) Prove that
|µ − m| ≤ σ.
(Hint: The function x 7→ |x | is convex). [5 points]
(c) The α-quantile of X is the real number qα which satisfies1
p(X ≤ qα) = α.
For τ ∈ (0, 1), define the pinball loss `τ : R→ R via
`τ(x) =
{
τx, x ≥ 0
(τ − 1)x, x < 0 .
Show that for any α ∈ (0, 1),
qα ∈ arg min
c∈R
E (`α(X − c)) . (1)
[5 points]
(d) One can show that
µ ∈ arg min
c∈R
E
(
(X − c)2
)
(2)
and given (2), by substitution we have
σ2 = min
c∈R
E
(
(X − c)2
)
In light of this, and of part (c), for α ∈ (0, 1), give an interpretation of
Qα = min
c∈R
E (`α(X − c)) .
Argue that, like σ2, for α ∈ (0, 1), Qα measures the deviation or variability of X .
Explain why Qα(X) = 0 only when X is constant. What advantages might Qα have
over σ2 as a measure of variability? [5 points]
1 The quantile is not necessarily unique. Observe that q1/2 = m.
5
(e) (Don’t panic!) A property T of a distribution P is a real number that summarises
some aspect of the distribution. One often nees to simplify from grappling with an
entire distribution to a single number summary of the distribution. Means, variances,
and quantiles are all properties, as is the entropy since we can just as well consider
the entropy of a random variable X as a property of its distribution P (think of the
defintion) and thus write H(P). Expressions such as (1) and (2) are examples of
eliciting a property of a distribution. In general2 one might have a function S (called
a scoring function) such that for some desired property T of a distribution P one has
T(P) ∈ arg min
c∈R
EY∼PS(c,Y ), (3)
where Y ∼ P means that Y is a random variable with distribution P. It turns out3
that not all properties T can be elicited; that is, there is no S such that T(P) can be
written in the form (3). For example, the variance can not be elicited. A necessary
(and sufficient) condition for a property to be elicitable is that for arbitrary P0, P1
such that T(P0) = T(P1) we have that for all α ∈ (0, 1),
T((1 − α)P0 − αP1) = T(P0).
The distribution (1 − α)P0 + αP1 is called a mixture of P0 and P1.
Construct an example P0, P1 such that H(P0) = H(P1) but for some α ∈ (0, 1), the
entropy of the mixture distribution satisfies
H((1 − α)P0 − αP1) , H(P0)
to prove (using the above fact) that one can not elicit entropy — that is, there can be
no equation of the form (3) which yields the entropy H(P).
(Hint: Easier than it might look. Start simple!) [5 points]
2In (1) we have S(c, x) = |x − c | and in (2) we have S(c, x) = (x − c)2.
3This is a non-trivial result, which requires several additional technicalities to state precisely. It is proved in
[Ingo Steinwart, Chloé Pasin, Robert C. Williamson, and Siyu Zhang, “Elicitation and identification of properties.”
In JMLR: Workshop and Conference Proceedings, 35 (Conference on Learning Theory), pp. 1–45. 2014].
6
5. Suppose X is a real valued random variable with µ = E(X) = 0.
(a) Show that for any t > 0,
E(etX) ≤ eg(t(b−a))
where g(u) = −γu + log(1 − γ + γeu), with γ = −a/(b − a).
(Hint: write X as a convex combination of a and b, where the convex weighting
parameter depends upon X . Exploit the convexity of the function x 7→ et x and
the fact that inequalities are preserved upon taking expectations of both sides, since
expectations are integrals.) [5 points]
(b) By using Taylor’s theorem, show that for all u > 0, g(u) ≤ t
2(b−a)2
8 and hence
E(etX) ≤ et
2(b−a)2/8.
Furthermore, suppose E(X) = µ , 0. Show that
E(etX) ≤ etµet
2(b−a)2/8.
[5 points]
(c) Show that for any random variable X and any t > 0,
p(X > �) ≤ inf
t≥0
e−t�E(etX).
(Hint: Rewrite using ∀t and use Markov’s inequality.) [5 points]
(d) Assume X1, . . . , Xn are independent random variables with p(Xi ∈ [a, b]) = 1 for
i ∈ {1, . . . , n} and E(Xi) = µ for i ∈ {1, . . . , n}. Let X̄n = 1n
∑n
i=1 Xi be the empirical
mean. Show that for any � ∈ (0, 1),
p
(
| X̄n − µ| ≥ �
)
≤ 2e−2n�
2/(b−a)2 .
[5 points]
(e) Suppose X1, . . . , Xn are iid Bernoulli random variables with parameter θ ∈ (0, 1).
The above result implies that
p(| X̄n − θ | > �) ≤ 2e−2n�
2
Compare this bound with what one obtains using the Chebyshev and Markov in-
equalities. For different θ, plot the three bounds for varying n. For � = 0.1, what
value n0 do each of the three bounds tell you that is necessary in order that for n ≥ n0,
p(| X̄n − θ | > 0.1) ≤ 0.1? [5 points]
7