CS代考 COMP2610/6261

SECTION A.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
1. (8pts)Ascientistconducts1000trialsofanexperimentinvolvesvariablesX,Y,eachwith possible values {0, 1}. He records the outcomes of the trials in the following table.
Counts X 01

Copyright By PowCoder代写 加微信 powcoder

0 50 400 1 500 50
Compute each of the following, showing all your working. (a) (1 pt) p(X = 1,Y = 1).
By definition, p(X = 1,Y = 1) = 50 = 0.05. 1000
(b) (2 pt) p(X = 1). Bythesumrule,p(X=1)=p(X=1,Y=1)+p(X=1,Y=0)= 450 =0.45.
1000 Bydefinitionofexpectation,E[X]=p(X =1)·1+p(X =0)·0=p(X =1)=0.45.
(c) (1 pt) E[X].
(d) (2 pt) p(Y = 1|X = 1).
By definition of conditional probability, p(Y = 1|X = 1) = p(X=1,Y=1) = 50 = 1. p(X=1) 450 9
(e) (2 pt) p(X + Y = 1). Wehavep(X+Y=1)=p(X=0,Y=1)+p(X=1,Y=0)= 900 =0.9.
2. (9 pts) Lorelai is taking an exam that involves multiple choice questions, with available choices {1, 2, . . . , c}, where c ≥ 2. For each question, there is only one correct answer, and Lorelai is allowed to provide only one choice as her answer.
With 90% probability, Lorelai knows the answer for a question. When Lorelai knows the answer for a question, she always picks the correct choice. When Lorelai does not know the answer for a question, she makes a guess. When making a guess, Lorelai is equally likely to pick any of the c available choices.
(a) (2 pt) Show that the probability Lorelai does not know the answer for a question and
guesses the correct answer is
Let A ∈ {0, 1} denote whether the question is answered correctly, and K ∈ {0, 1} denote whether the student knew the answer. From the information provided,
p ( A = 1 | K = 0 ) = 1c p(K = 1) = 0.9.
p(A = 1, K = 0) = p(A = 1|K = 0)p(K = 0) = 1 · (1 − 0.9) = 0.1.
Page 2 of 6 – INFORMATION THEORY – COMP2610/6261

(b) (2 pt) Show that the probability Lorelai answers a question correctly is 0.9 + 0.1.
p(A = 1) = p(A = 1|K = 1)p(K = 1) + p(A = 1|K = 0)p(K = 0)
= 1 · 0 . 9 + 1c · ( 1 − 0 . 9 ) = 0 . 9 + 1c · 0 . 1 .
(c) (3 pt) Suppose Lorelai answers a question correctly. Show that the probability she
knew the answer is
9c . 9c + 1
We are interested in p(K = 1|A = 1). Using Bayes’ rule, and part (b), we get
p(K = 1|A = 1) = p(A = 1|K = 1)p(K = 1) p(A = 1)
0.9 + c−1 (0.1)
=9 9+c−1(1)
= 9c . 9c + 1
(d) (2 pt) When c gets larger, does the probability in part (c) get larger or smaller? Explain your answer intuitively in one or two sentences.
It gets larger. When c is big, the more likely it is that Lorelai knows the answer, because it’s unlikely she gets the answer by just guessing.
3. (8 pts)
(a) (2 pt) In one or two sentences, explain what the likelihood function for a parameter is.
The likelihood function is the density function regarded as a function of θ ∈ Θ, the set of parameters to be estimated. Concretely, the sample X = (X1, X2, …, Xn) is the sample X of random variables chosen according to one of a family of probabilities Pθ. The likelihood function is the is a function of θ:
L(θ|x) = f (x|θ).
(b) (2 pt) In one or two sentences, explain what the maximum likelihood estimate for a parameter is.
The maximum likelihood estimator (MLE),
θˆ(x) = arg max L(θ|x)
Page 3 of 6 – INFORMATION THEORY – COMP2610/6261

(c) (2 pt) Suppose we have a coin which comes up heads with unknown probability p. We perform 5 independent flips of the coin, and observe the outcomes
{heads, heads, tails, heads, heads}.
Based on this data, what is the maximum likelihood estimate for the probability p?
The MLE for n Bernoulli trials is pˆ = n−1 􏰐n Xi, that is the average fraction of
successes. In this case we have pˆ = 4/5.
(d) (2 pt) Suppose our prior belief in the coin coming up heads follows a Beta(a, b) distribution for certain constants a, b. In one or two sentences, describe a procedure which can additionally incorporate this prior belief to estimate the probability p. (You do not need to compute the estimate for p using this procedure.)
We could use a Bayesian method, which would the posterior distribution of the parameter p, Pr(p | Data) ∝ L(p | Data) · Pr(p). We could in particular compute the mode of the posterior distribution, giving the maximum a-posteriori estimate. In the case of a Bernoulli random variable, this corresponds to adding “fake” occurrences of heads and tails and then computing the fraction of heads as usual.
Page 4 of 6 – INFORMATION THEORY – COMP2610/6261

SECTION B.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
4. (9 pts) Suppose X represents whether it is sunny or cloudy in Canberra, and Y represents whether it is warm or humid in Canberra. The joint distribution p(X,Y) is
p(X=sunny,Y=warm) p(X=sunny,Y=humid) p(X=cloudy,Y=warm)
p(X = cloudy,Y = humid) = 0.
(a) (2 pt) In one or two sentences, explain what the entropy of a random variable represents.
It is the inherent uncertainty in the random variable. The closer to a uniform distribution, the higher the uncertainty.
(b) (3 pt) Compute the entropy H(X).
We have p(X = sunny) = 1/2 + 1/4 = 3/4. So, the entropy is
H(X) = H2(3/4)
=3·log 4+1·log 4
42342 = log2 4 − 34 log2 3
= 2 − 34 log2 3. (c) (2 pt) Compute the joint entropy H(X,Y).
􏰎1 1 1 1 1 1 􏰏 H(X,Y)=− 2log2+4log4+4log4+0log0 =1.5.
(d) (2 pt) Compute the conditional entropy H(Y|X). We have
H(Y|X) = H(X,Y) − H(X)
= 1.5 − 2 + 0.75 · log2 3
= 0.75 · log2 3 − 0.5.
5. (12 pts) Let X and Y be independent random variables with possible outcomes {0, 1}, with
p ( X = 1 ) = 12 p ( Y = 1 ) = 12 .
Let Z = X +Y.
Page 5 of 6 – INFORMATION THEORY – COMP2610/6261
=1/2 = 1/4 = 1/4

(a) (1 pt) Compute I(X;Y).
As the variables are independent, I(X;Y) = 0.
(b) (4 pt) Show that
p(X = 1|Z = 0) = 0 p ( X = 1 | Z = 1 ) = 12
p(X = 1|Z = 2) = 1. Consider the following possibilities:
Therefore:
000 011 101 112
p(X = 1|Z = 1) = p(X = 1, Z = 1) p(Z = 1)
= p(X = 1,Y = 0)
p(X =1,Y =0)+p(X =0,Y =1)
p(X = 1|Z = 0) = 0 p(X = 1|Z = 2) = 1
= 12 . (c) (3 pt) Compute H(X|Z).
H(X|Z)= 􏰄 p(Z=z)·H(X|Z=z) z∈{0,1,2}
= p(Z = 1) · H(X|Z = 1)
= p(Z = 1) · H2(1/2)
= (p(X = 0,Y = 1) + p(X = 1,Y = 0)) · 1
= 21 . (d) (2 pt) Compute I(X; Z).
I(X; Z) = H(X) − H(X|Z) = 1 − 12
Page 6 of 6 – INFORMATION THEORY – COMP2610/6261

(e) (2 pt) Explain why parts (a) and (d) do not contradict the data-processing inequality.
It doesn’t because (X,Y, Z) doesn’t form a Markov chain.
6. (4 pts) Dr. Seuss has set an exam for his class, where the possible scores are {0, 1, 2, . . . , 100} i.e. the exam is out of 100 marks. Dr. Seuss found that the average exam score of the students is 60 marks out of 100.
(a) (2 pt) What is an upper bound on the fraction of students who scored 80 or more in the exam?
Let X be the student score. Then E[X] = 60. Further, by Markov’s inequality, P(X ≥ 80) ≤ E[X] = 60 = 3.
(b) (2 pt) Dr. Preibus has set an exam for a different class, where the possible scores are {−100, −99, . . . , 99, 100} i.e. the exam is out of 100 marks, but it is possible to score negative marks. Dr. Preibus found that the average exam score of the students is 60 marks out of 100.
Is your answer from (a) also an upper bound on the fraction of students who scored 80 or more in Dr. Preibus’ exam? Explain why or why not.
No, it isn’t. Markov’s inequality assumes the random variable is nonnegative.
Page 7 of 6 – INFORMATION THEORY – COMP2610/6261

SECTION C.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
7. (6 pt) Let X be an ensemble with AX = {a,b} and probabilities pX = 􏰉1, 3􏰊. 44
(a) (1 pt) Write down the alphabet and probabilities for the extended ensemble X2. WehaveAX2 ={aa,ab,ba,bb}andpX2 =(1/16,3/16,3/16,9/16).
(b) (3 pt) Compute the essential bit content Hδ(X2) for δ = 38.
The extended ensemble X 2 has cumulative probabilities (1/16, 1/4, 7/16, 1). We
knowthat 1 = 4 < 3 = 6 < 7 . So,theessentialbitcontentislog 2=1. 4 16 8 16 16 2 (c) (2pt)Supposeδ=0.IsittruethatN1Hδ(XN)willbearbitrarilyclosetotheentropy H(X) for N sufficiently large? Explain why or why not. No. N1H0(XN)= N1 ·log|AXN|=log2=1. Thisfallsoutsidethepurviewofthe SCT, which requires 0 < δ < 1. 8. (10 pts) Let X be an ensemble with AX = {x1, x2, x3} and probabilities pX = 􏰉14, 12, 41􏰊. (a) (3 pt) Compute the modified distribution function F ̄(xi) for each outcome xi. F ̄(x1) = 12 · 41 = 81 = 0.0012 F ̄ ( x 2 ) = 14 + 12 · 12 = 12 = 0 . 1 2 F ̄(x3) = 14 + 12 + 21 · 14 = 78 = 0.1112. (b) (3 pt) Compute the Shannon-Fano-Elias codeword lengths l(xi) for each outcome xi. l(x1) = ⌈log2 4⌉ + 1 = 3 l(x2) = ⌈log2 2⌉ + 1 = 2 l(x3) = ⌈log2 4⌉ + 1 = 3. (c) (4 pt) Construct a Shannon-Fano-Elias code for X. We have C(x1) = 001 C(x2) = 10 C(x3) = 111. 9. (4 pts) Let X be an ensemble with AX = {a, b} and fixed probabilities pX = (pa, pb). Page 8 of 6 – INFORMATION THEORY – COMP2610/6261 (a) (2 pt) Conan has constructed a prefix-free code C1 for X, with expected length L(C1, X). Is it possible that L(C1, X) < H(X)? Explain why or why not. No, by the source coding theorem for symbol codes we cannot go below the entropy. (b) (2 pt) Ronan has constructed a Huffman code C2 for X . Is it possible that L (C2, X ) > H(X)? Explain why or why not.
Yes, Huffman codes don’t have to exactly attain the entropy.
10. (5 pts) Suppose we wish to compress a string with LZ77 with a window size W = 2. Suppose we place two conditions on the string: it must have length 3 or more, and can only contain the symbols {a, b}. (For example, aba and bbb are valid strings, but ba (which has length 2) and abc (which contains the symbol c) are not.)
Recall that in LZ77, the output at any iteration is either of the type SYM (for an exact symbol match) or POINTER (for a window match).
(a) (2 pt) Give an example of a string satisfying the above conditions where, at every iteration of LZ77 except the first, the output is of the type POINTER.
Consider the string aaa.
(b) (3 pt) Is it possible that for some string satisfying the above conditions, at every iteration of LZ77, the output is of the type SYM? If yes, give an example of one such string. If no, explain why not.
No. Suppose we have SYM at the second iteration. Then the string starts with ab or ba. In either case, it is impossible for the third character to be anything else than a POINTER to something in the window.
Page 9 of 6 – INFORMATION THEORY – COMP2610/6261

SECTION D.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
11. (6 pts) Suppose we wish to construct a code using input alphabet X = {0, 1}.
(a) (1 pt) Give an example of a (3, 2) block code using input alphabet X. We need 22 = 4 codewords of length 3. Pick e.g. {000, 001, 010, 011}.
(b) (1 pt) What is the rate of your code from (a)?
ItisKN =23.
(c) (2 pt) Suppose you use your code from (a) on a channel with capacity 0.5. Is it possible that your code can achieve arbitrarily small probability of maximal block error? Explain why or why not.
No. The NCCT says that we cannot achieve any rates about the capacity. And
32 > 0.5, so such a rate is non-achievable on this particular channel.
(d) (2 pt) Suppose you use your code from (a) on a channel with unknown capacity. You find that the code can achieve arbitrarily small probability of maximal block error on this channel. What, if anything, can you conclude about the capacity of this channel? The NCCT says any achievable rate must be below the capacity. So, we know the capacity is at least 32 .
12. (14 pts) Consider a channel over inputs X = {1, 2} and outputs Y = {1, 2} with transition
Let pX = (θ, 1 − θ) be a distribution over the inputs. Let H2(a) denote the entropy of a
Bernoulli random variable with parameter a, i.e. H2 (a) = −a · log a − (1 − a) · log(1 − a).
(a) (2 pt) Is Q symmetric? Explain why or why not. No, we cannot partition into feasible rows.
(b) (3 pt) Compute p(Y = y) for y ∈ {1, 2}, expressing your answer in terms of θ. We can compute
p ( Y = 1 ) = 34 θ + 12 ( 1 − θ ) = 41 θ + 12
p ( Y = 2 ) = 14 θ + 12 ( 1 − θ ) = − 14 θ + 21 .
(c) (2 pt) Compute H(Y|X), expressing your answer in terms of θ. We have
H(Y|X) = 􏰄 H(Y|X = x) · p(X = x) x∈X
= θ · H2(1/4) + (1 − θ) · H2(1/2) = θ · H2(1/4) + (1 − θ).
􏰑3/4 1/2􏰒 Q= 1/4 1/2 .
Page 10 of 6 – INFORMATION THEORY – COMP2610/6261

(d) (3 pt) Compute I(X;Y), expressing your answer in terms of θ. We have
I(X;Y) = H(Y) − H(Y|X) 􏰎1 1􏰏
=H2 4θ+2 −(θ·H2(1/4)+(1−θ)) 􏰎1 1􏰏
=H2 4θ+2 +θ·(1−H2(1/4))−1.
(e) (3 pt) Show that the mutual information is maximised when θ satisfies
12 + 14θ = 24·(1−H2(1/4)). 12 − 14 θ
Hint: Recall that H2(x) has derivative H′(x) = −log x 2 2 1−x
so at optimality
d 1 12 + 14 θ dθI(X;Y)=−4·log2 12−14θ+(1−H2(1/4)),
log2 21+41θ=4·(1−H2(1/4)) 12 − 41 θ
21 + 14θ = 24·(1−H2(1/4)). 12 − 14 θ
(f) (1 pt) How do the answers to parts (d) and (e) relate to the channel capacity of Q? The capacity is the mutual information in (d) at the value of θ in (e).
13. (3 pts) Calculate the value of the three parity bits for the message x = 1111 when it is coded using a (7, 4) Hamming code. You may use a diagram to show your working.
The result should be 1111111.
14. (2 pts) Two distinct notions of uncertainty we have looked at are (prefix) Kolmogorov
complexity K, and Shannon entropy H. What are the pros and cons of K versus H? K is:
• not computable, unlike H.
• a key ingredient in Solomonoff induction, unlike H.
• dependent on a probability distribution, unlike K.
• a key ingredient in fundamental limits in coding and compression, unlike K.
Page 11 of 6 – INFORMATION THEORY – COMP2610/6261

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com