COMP2610 / COMP6261 — Information Theory: Assignment 1
Instructions
This assignment contributes 20% to your final score for the course.
Marks: The maximum marks available are given in square brackets. Each question is worth 25 marks in total. For questions where you are asked to prove results, if you can not prove a precedent part, you can still attempt subsequent parts of the question assuming the truth of the earlier part.
COMP2610 students: Answer questions 1-4 only. You are not expected to answer question 5.
COMP6261 students: Answer questions 4 and 5 plus [two questions chosen from Q1, Q2, or Q3]. You will only get marks for 4 questions (two of which must be Q4 and Q5).
Submission: Youshouldsubmitapapercopyofyourassignmenttotheassignmentsubmission box on the ground floor of the CSIT building by the due date. It is not necessary to type your answers, but if you use hand writing it must be clearly legible.
You must include your name, UNI-ID and your Tutor group clearly at the top of the first page of your submission.
You must show all your working: it is not sufficient to merely write down an answer!
Cheating and Plagiarism: All assignments must be done individually. Plagiarism is a univer- sity offence and will be dealt with according to university procedures http://academichonesty.anu.edu.au/UniPolicy.html.
1
1. In the game of Gah! Bad DECAF! a player draws letter tiles out of a bag. Each of the 16 tilesinthebaghasaletterl ∈ {A,B,C,D,E,F,G,H}ononesideandavaluev ∈ {1,3,5} on the other side. The number of tiles for each letter that appear in the bag, and the associated value for each letter are shown in the table below:
Letter A B C D E F G H Count 4 2 1 2 4 1 1 1 Value 1 5 3 3 1 5 3 5
Let d = {yes, no} indicate whether a tile is a DECAF tile or not. That is, d = yes if l ∈ {D, E, C, A, F} and d = no, otherwise. The uppercase symbols L, V, and D will be used to respectively denote the ensembles associated with the letter, value, and DECAF status of a tile randomly drawn from the bag. All tiles in a bag are equally likely to be chosen when a draw is made.
- (a) Suppose a single tile is drawn from the bag. Calculate:
- The information in seeing an A. That is, compute h(l = A). [2 points]
- The information in seeing a value of 3. That is, h(v = 3). [2 points]
- The conditional information h(l = A|v = 5) of flipping a tile over and seeing an A given that a tile with value 1 was drawn. [2 points]
- The conditional information h(v = 1|l = A) of flipping a tile over and seeing a 1 given that an A tile was drawn. Give an intuitive explanation for this answer. [2 points]
- The entropies H(L), H(V) and H(D). [2 points]
- (b) Calculate the mutual information I(L;V) between the letter and value of a tile, and the mutual information I(D;V) between the DECAF status and value of a tile. Using the data processing inequality, explain the relationship between these two quantities.
[5 points]
- (c) Exactly half of all people who play Gah! Bad DECAF! cheat by secretly throwing out all their As and Es, leaving behind a bag with only 8 of the 16 original tiles. We will write c = unfair if a person cheats and c = fair otherwise and let C be the associated ensemble with p(fair) = p(unfair) = 0.5.
- What is the conditional entropy H(D|C)? [2 points]
- Use the previous result to determine H(C, D). [3 points]
- Suppose you were playing against a randomly chosen opponent and she plays a DECAF letter (i.e., d = yes) three times in a row, replacing her tile back in her bag after each play. What probability would you assign to her being a cheat?
[5 points]
2
2. Let Y be a random variable with possible outcomes {0, 1}, and p(Y = 1) = 1 . Let X be a
2
random variable with possible outcomes X = {a, b, c}. Define p=(p(X =a|Y =1),p(X =b|Y =1),p(X =c|Y =1))
q=(p(X =a|Y =0),p(X =b|Y =0),p(X =c|Y =0)). (a) Suppose that
1 1 1 p= 2,4,4
1 1 1 q= 4,2,4 .
Compute I(X;Y).
(b) Using the definition of mutual information, show that for any choice of p, q,
I(X;Y) = DKL(p||m) + DKL(q||m), 2
[3 points]
[10 points]
where m = (p + q)/2.
(c) Let Z be a random variable with outcomes x ∈ X, such that for all x ∈ X and y ∈ {0, 1}, p(Z = x|Y = y) = p(X = x|Y = 1 − y). Using part (b) compute I(Z;Y). Explain intuitively the relation of your answer to part (a). [4 points]
- (d) Suppose p and q are as in part (a). Using part (b), or otherwise, give an example of a random variable Z with possible outcomes X satisfying I(X;Y) > I(Z;Y). Explain your answer in terms of the data-processing inequality. [4 points]
- (e) Suppose p and q are as in part (a). Give an example of a random variable Z with possible outcomes x ∈ X satisfying I(X;Y) < I(Z;Y). Explain your answer in terms of the data-processing inequality. [4 points]
3
3. [25 points] Suppose a music collection consists of 4 albums: the album Alina has 5 tracks; the album Bends has 12; the album Coieda has 15; and the album Debut has 12.
- (a) How many bits would be required to uniformly code:
- The index of all the albums? (We are not encoding the actual music, but merely the titles — the metadata). Give an example uniform code for the albums. [2 points]
- Only the tracks in the album Alina. Give an example of a uniform code for the tracks assuming they are named “Track 1”, “Track 2”, etc. [2 points]
- All the tracks in the music collection? [2 points]
- (b) What is the raw bit content required to distinguish all the tracks in the collection?
[2 points]
- (c) Supposeeverytrackinthemusiccollectionhasanequalprobabilityofbeingselected. Let A denote the album title of a randomly selected track from the collection.
- Write down the ensemble for A — that is, its alphabet and probabilities. [2 points]
- What is the raw bit content of A4? [2 points]
- What is the smallest value of δ such that the smallest δ-sufficient subset of A4
contains fewer than 256 elements? [3 points]
- WhatisthelargestvalueofδsuchthattheessentialbitcontentHδ(A4)isstrictly
greater than zero? [4 points]
- (d) Suppose the album titles ensemble A is as in part (c).
- ComputeanapproximatevaluefortheentropyH(A)totwodecimalplaces(you may use a computer or calculator to obtain the approximation but write out the expression you are approximating). [2 points]
- Approximately how many elements are in the typical set TN β for A when N = 100 and β = 0.1? [2 points]
- Is it possible to design a uniform code to send large blocks of album titles with a 95% reliability using at most 1.5 bits per title? Explain why or why not. [2 points]
4
4. Let X be a real-valued random variable with mean μ and standard deviation σ < ∞. The median of X is the real number m ∈ R satisfying
p(X ≥ m) = p(X ≤ m) = 1. 2
If g is some function on R, then arg minx g(x) is the value x∗ that minimises g(x): that is g(x∗) ≤ g(x) for all x, and minx g(x) = g(x∗). The argmin may not be unique, and so we consider it to be set-valued and thus write x∗ ∈ arg minx g(x).
(a) Prove that
(Hint: break the expectation into two conditional expectations.)
- (b) Prove that
|μ − m| ≤ σ. (Hint: The function x → |x| is convex).
- (c) The α-quantile of X is the real number qα which satisfies1 p(X ≤ qα) = α.
For τ ∈ (0, 1), define the pinball loss lτ : R → R via
τx, x ≥ 0
[5 points]
[5 points]
lτ(x)= (τ−1)x, x<0 . Show that for any α ∈ (0, 1),
(d) One can show that
qα ∈argminE(lα(X−c)). c∈R
2 μ∈argminE (X−c)
(1)
[5 points]
(2)
m ∈ arg min E (|X − c|) c∈R
c∈R and given (2), by substitution we have
2 2 σ =minE (X−c)
c∈R
In light of this, and of part (c), for α ∈ (0, 1), give an interpretation of
Qα =minE(lα(X−c)). c∈R
Argue that, like σ2, for α ∈ (0, 1), Qα measures the deviation or variability of X. Explain why Qα(X) = 0 only when X is constant. What advantages might Qα have over σ2 as a measure of variability? [5 points]
1 The quantile is not necessarily unique. Observe that q1/2 = m. 5
(e) (Don’t panic!) A property T of a distribution P is a real number that summarises some aspect of the distribution. One often nees to simplify from grappling with an entire distribution to a single number summary of the distribution. Means, variances, and quantiles are all properties, as is the entropy since we can just as well consider the entropy of a random variable X as a property of its distribution P (think of the defintion) and thus write H(P). Expressions such as (1) and (2) are examples of eliciting a property of a distribution. In general2 one might have a function S (called a scoring function) such that for some desired property T of a distribution P one has
T(P) ∈ argminEY∼PS(c,Y), (3) c∈R
where Y ∼ P means that Y is a random variable with distribution P. It turns out3 that not all properties T can be elicited; that is, there is no S such that T(P) can be written in the form (3). For example, the variance can not be elicited. A necessary (and sufficient) condition for a property to be elicitable is that for arbitrary P0, P1 such that T(P0) = T(P1) we have that for all α ∈ (0, 1),
T((1 − α)P0 − αP1) = T(P0).
The distribution (1 − α)P0 + αP1 is called a mixture of P0 and P1.
Construct an example P0, P1 such that H(P0) = H(P1) but for some α ∈ (0, 1), the entropy of the mixture distribution satisfies
H((1 − α)P0 − αP1) H(P0)
to prove (using the above fact) that one can not elicit entropy — that is, there can be
no equation of the form (3) which yields the entropy H(P).
(Hint: Easier than it might look. Start simple!) [5 points]
2In (1) we have S(c, x) = |x − c| and in (2) we have S(c, x) = (x − c)2.
3This is a non-trivial result, which requires several additional technicalities to state precisely. It is proved in [Ingo Steinwart, Chloé Pasin, Robert C. Williamson, and Siyu Zhang, “Elicitation and identification of properties.” In JMLR: Workshop and Conference Proceedings, 35 (Conference on Learning Theory), pp. 1–45. 2014].
6
5. Suppose X is a real valued random variable with μ = E(X) = 0.
- (a) Show that for any t > 0,
E(etX) ≤ eg(t(b−a))
where g(u) = −γu + log(1 − γ + γeu), with γ = −a/(b − a).(Hint: write X as a convex combination of a and b, where the convex weighting parameter depends upon X. Exploit the convexity of the function x → etx and the fact that inequalities are preserved upon taking expectations of both sides, since expectations are integrals.) [5 points]
- (b) By using Taylor’s theorem, show that for all u > 0, g(u) ≤ t2(b−a)2 and hence 8
E(etX) ≤ et2(b−a)2/8. Furthermore, suppose E(X) = μ 0. Show that
E(etX) ≤ etμet2(b−a)2/8. (c) Show that for any random variable X and any t > 0,
p(X > ε) ≤ inf e−tεE(etX). t≥0
(Hint: Rewrite using ∀t and use Markov’s inequality.)
- (d) Assume X1, . . . , Xn are independent random variables with p(Xi ∈ [a, b]) = 1 for
i∈{1,…,n}andE(Xi)=μfori∈{1,…,n}.LetX ̄n=1n Xibetheempirical
mean. Show that for any ε ∈ (0, 1),
p |X ̄n − μ| ≥ ε ≤ 2e−2nε2/(b−a)2 .[5 points]
- (e) Suppose X1, . . . , Xn are iid Bernoulli random variables with parameter θ ∈ (0, 1). The above result implies that
p(|X ̄n − θ| > ε) ≤ 2e−2nε2
Compare this bound with what one obtains using the Chebyshev and Markov in- equalities. For different θ, plot the three bounds for varying n. For ε = 0.1, what value n0 do each of the three bounds tell you that is necessary in order that for n ≥ n0,
p(|X ̄n − θ| > 0.1) ≤ 0.1?
[5 points]
7
n i=1
[5 points]
[5 points]