CS代考 COMP2610/6261

SECTION A.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
1. (12 points) Suppose X and Y are random variables with the following joint distribution:
Compute each of the following, showing all your working.

Copyright By PowCoder代写 加微信 powcoder

(a) (2 points) Compute the marginal distributions of X and Y , i.e. determine the values
of P(X = x) for all x, and P(Y = y) for all y. x = 0,1,2,P(X = x) = 0.3,0.2,0.5
(b) (2 points) Compute the conditional distribution of X given that the value Y = 2, i.e. compute the values of P(X = x|Y = 2) for all x.
x = 0,1,2,P(X = x|Y = 2) = 0.2,0.4,0.4
(c) (2 points) Compute E[X] and E[Y].
E[X] = 0·0.3+1·0.2+2·0.5 = 1.2. Similarly, we have E[Y] = 1·0.5+2·0.5 = 1.5
(d) (4 points) Compute E[XY].
E[XY] = (0 × 1) × 0.2 + (0 × 2) × 0.1 + (1 × 1) × 0.0 + (1 × 2) × 0.2 + (2 × 1) × 0.3 + (2 × 2) × 0.2 = 1.8
(e) (2 points) Are X and Y independent? Explain your answer.
No, they are not independent, as can be seen from the fact that P(X = 1,Y = 1) 􏰍
P(X =1)P(Y =1).Leftsideis0butrhsis0.2×0.5=0.1
2. (5 points) You have a large bucket which contains 999 fair coins and one biased coin. The biased coin is a two-headed coin (i.e. it always lands heads). Suppose you pick one coin out of the bucket, flip it 10 times, and get all heads.
(a) (1 point) State Bayes’ rule.
P(A|B) = P(B|A)P(A) where A and B are events and P(B) 􏰍= 0 P(B)
(b) (1 point) State the Law of Total Probability.
P(X = x) = 􏰐j P(X = xi,Y = yj)
(c) (3 points) What is the probability that the coin you choose is the two-headed coin?
Let the event that it is a double headed coin be B where B is the short-hand for biased. We want to compute P(B|10H). So, by Bayes rule and Law of Total Probability
P(B|10H) = P(10H|B)P(B) P(10H)
1·1 = 1000
P(10H|B)P(B) + P(10H|Bc)P(Bc)
Page 2 of 5 – INFORMATION THEORY – COMP2610/6261

3. (5 points)
and Frequentist approaches to parameter estimation.
1·1 = 1000
1· 1 +􏰉1􏰊10 999 1000 2 1000
(a) (1point)Inoneortwosentences,explainwhatisthedifferencebetweentheBayesian
Frequentist: parameters are fixed (but unknown); there is no distribution over them Bayesian: parameters are random, and have a corresponding distribution.
(b) (1point)Inoneortwosentences,explainwhatthemaximumlikelihoodestimatefor a parameter is.
The maximum likelihood estimator (MLE),
θˆ(x) = arg max L(θ|x)
(c) (1 point) In one or two sentences, explain what is a prior distribution in the context of Bayesian inference.
An uncertain quantity that is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into consideration.
(d) (1 point) In one or two sentences, explain what the Maximum a posteriori (MAP) estimate for a parameter is.
We maximize the posterior or log posterior
(e) (1 point) In one or two sentences, explain the connections between MLE and the MAP estimates. (Hint: what priors should you use?)
Uniform distribution
4. (3 points) A biased coin is flipped until the first head occurs. Denote by Z the number of flips required with p being the probabilty of obtaining a head. Compute P(Z = k) stating clearly the possible values of k.
(Hint: Write down the probabilities of some possible outcomes.)
when x = 1, success. when x = 2, fail then success, when x= 3, fail fail success
P(Z = k) = qk−1p
for k = 1, 2, 3, …
Page 3 of 5 – INFORMATION THEORY – COMP2610/6261

SECTION B.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
5. (4 points) Suppose the random variables X and Y are related by the following probabilities P(X = x,Y = y) :
0 1/3 1/3 1 0 1/3
Compute the following:
(a) (2 points) H(X), H(Y).
(b) (2 points) H(X|Y), H(Y|X).
First compute the marginal distributions: P(x) = (2/3, 1/3) and P(y) = (1/3, 2/3), we
now have (a)
also similarly
2 􏰎2􏰏 1 􏰎1􏰏 H(X) = −3 log2 3 − 3 log2 3
1 􏰎1􏰏 2 􏰎2􏰏 H(Y) = −3 log2 3 − 3 log2 3
= 0.918bits.
= 0.918bits.
(b) We have
H(X|Y) = 13 · H(X|Y = 0) + 32 · H(X|Y = 1) = 31H(1,0) + 23H(1/2,1/2) = 2/3
similarly for H(Y|X), we get
H(Y|X) = 23 · H(Y|X = 0) + 31 · H(Y|X = 1) = 32H(1/2,1/2) + 13H(0,1) = 2/3
6. (5 points)
(a) (3 points) What is a typical set? How does the typical set relate to the smallest delta-sufficient subset?
The typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of a random variable of interest.
Relationship: it is used in the proof of SCT, as n → ∞, Sδ and typical set increasingly overlap. Hence, we look to encode all typical sequences uniformly, and relate that to the essential bit content by taking the log of smallest delta-sufficient subset.
Page 4 of 5 – INFORMATION THEORY – COMP2610/6261

(b) (2 points) Is the most likely sequence always a member of the typical set? Explain your answer.
The most likely sequence is in general not in the typical set. For example for Xk iid with P(0) = 0.1 and P(1) = 0.9, (1,1,1,…,1) is the most likely sequence, but it is not typical because its empirical entropy is not close to the true entropy.
7. (3 points) Construct a Huffman code for the ensemble with alphabet AX = {a, b, c} and probabilities p = (0.6, 0.3, 0.1). Show all your working.
The Huffman code for the distribution is (0.6, 0.3, 0.1) is (1, 01, 00)
8. (7 points) Suppose a single fair six-sided die is rolled, and let Y be the outcome.
(a) (3 points) Compute the expected value of Y and the variance of Y .
Use standard formula: E(Y) = 61(1 + 2 + 3 + 4 + 5 + 6) = 3.5 and Var(Y) =
1􏰓2 2 2 2 2 2􏰔35 6 (1−3.5) +(2−3.5) +(3−3.5) +(4−3.5) +(5−3.5) +(6−3.5) = 12.
(b) (1 point) Calculate an upper bound for the quantity P(Y ≥ 6) using Markov’s inequality.
By Markov’s inequality, we get:
P(Y ≥ 6) ≤ 3.5 = 0.583. 6
(c) (3 points) Calculate an upper bound for the quantity P(Y ≥ 6) using Chebyshev’s inequality and compare this to the answer obtained in part (b). Which is closer to the true value of P(Y ≥ 6)?
By Chebyshev’s inequality, we get:
P(Y ≥6)≤P(Y ≥6orY ≤1)=P(|Y−3.5|≥2.5)≤ 3.5 = 7 =0.467.
Chebyshev’s is closer.
9. (6 points)
(a) (1 point) What is the purpose of the sigmoid function in logistic regression?
To squash the score to be in the range of 0 to 1
(b) (2 points) Suppose you have a trained logistic regression model with weights w. Roman proposes to classify a new point xnew as positive by checking if xTneww > 0. Alice proposes to classify it as positive by checking if σ(xneww) > 0.5, where σ(·)is the sigmoid function. Will these result in the same prediction? Explain why or why not.
Yes they will result in the same prediction. This is because the sigmoid is a monotone increasing function, and sigmoid(0) = 0.5. Hence sigmoid(x) > 0.5 iff x > 0.
(c) (3points)Inoneortwosentences,explaintherelationshipbetweenlogisticregression and the maximum entropy principle.
(1) The maxent principle for estimating probabilities: ’When choosing amongst mul- tiple possible distributions, pick the one with highest entropy’. (2) Given information
Page 5 of 5 – INFORMATION THEORY – COMP2610/6261

about a probability distribution, we can find the maximum entropy distribution using Lagrangian optimisation. (3) Logistic regression can be derived from the (condi- tional) maximum entropy principle
Page 6 of 5 – INFORMATION THEORY – COMP2610/6261

SECTION C.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
10. (10 points) Suppose Y is an ensemble equipped with AY = {a, b, c} and probabilities p = (0.5, 0.25, 0.25).
(a) (2 points) Write down the alphabet and probabilities for the extended ensemble Y2. We have that
AZ ={aa,ab,ba,ac,ca,bc,cb,bb,cc}
p = {0.25, 0.125, 0.125, 0.125, 0.125, 1/16, 1/16, 1/16, 1/16}
(b) (3 points) Assuming the symbols for Y2 are in alphabetical order, so that e.g. aa appears before ab, what is the binary interval for ab in a Shannon-Fano-Elias code forY2?
Alphabetical order means that ab will be the second, so that cdf gives F(aa) = 0.25, F(ab) = 0.375
so the interval is
[0.25, 0.374) − decimal [0.01, 0.011] − binary
(c) (3 points) What is the smallest δ−sufficient subset for Y 2 when δ = 0.45? P(Y ∈ Sδ ) ≥ 1 − 0.45 = 0.55 so Sδ = {aa, ab, ba, ac}.
(d) (2 points) What is the essential bit content Hδ(Y2) when δ = 0.45? Hδ(Y2) = log2 |Sδ| = 2.
11. (6 points)
(a) (1 point) Is every prefix-free code uniquely decodable? If yes, explain why. If no, provide a counter-example.
A prefix code is uniquely decodable, i.e. given a complete and accurate sequence, a receiver can identify each word without requiring a special ’marker’ between the words.
This is an inductive argument: suppose the messages x and y disagree at the Kth symbol. Then by definition the codewords for the Kth symbol must disagree, since no codeword can be a prefix of another. Hence the codes of the entire message must be different.
(b) (1 point) Is the code C = {0, 01, 011} uniquely decodable? Explain your answer. Yes – as we move along the message, we uncover the first, second and third codeword.
(c) (2 points) Explain the difference between a lossless and uniquely decodable code.
Recall that a code is lossless if for all x, y ∈ AX
x􏰍y =⇒ c(x)􏰍c(y)
Page 7 of 5 – INFORMATION THEORY – COMP2610/6261

This ensures that if we work with a single outcome, we can uniquely decode the outcome. When working with variable-length codes, however, unique decodability is defined as follows: A code c for X is uniquely decodable if no two strings from AX have the same codeword. That is, for all ⃗x, ⃗y ∈ AX
⃗x 􏰍 ⃗y = ⇒ c ( ⃗x ) 􏰍 c ( ⃗y )
The crux of the matter: one is a number x but the other is a vector ⃗x.
(d) (2 points) Consider a source W = {a, b, c, d, e}. Explain if it is possible to construct a prefix code for this source with the proposed lengths: la = 1, lb = 2, lc = 3, ld = 4, le = 4, without actually giving an example of a code? (Hint: What conditions should you check?)
It satisfies the Kraft’s inequality with exact equality, 􏰐 so yes we can construct a prefix code for this source.
12. (9 points) Let X be an ensemble with alphabet AX = p = (1/2, 1/4, 1/8, 1/8) and the code C = (0000, 01, 11, 0).
2−lx = 1 + 1 + 1 + 1 + 1 = 1, 2 22 23 24 24
(a) (2 points) What is the entropy H(X) as a single number?
(b) (2 points) What is the expected code length L(C, X)?
(c) Which of these are Shannon codes? Justify your answers.
i. (1 point) A = {0, 10, 110, 111}
ii. (1 point) B = {000, 001, 010, 111}
iii. (1 point) C = {0, 01, 001, 010}
(d) (2 points) Is the code A in part (c)[i] optimal? Explain why or why not.
(a) H(X)=21log22+14log24+18log28+81log28=1.75
(b) L(C,X)=21×4+41×2+18×2+18×1=2.875=278
(c) Shannon codes for X has the following code length:
i. (1point)A={0,10,110,111}-yes
ii. (1 point) B = {000, 001, 010, 111} – no
iii. (1 point) C = {0, 01, 001, 010} – no
(d) It is optimal because L(A, X) = 12 × 1 + 41 × 2 + 2 × 81 × 3 = 1.75 = H(X). By SCT
this is optimal
Page 8 of 5 – INFORMATION THEORY – COMP2610/6261
{a, b, c, d}
with probabilities

SECTION D.
Answer each of the following questions [ Marks per questions as shown; 25% total ]
[5 points] Consider a (5, 3) block code C.
(a) (3 points) What is the length of each codeword in class C? How many codewords
does class C define? Compute the rate for class C.
There are 23 = 8 codewords, each with length 5. The rate is 3/5 = 0.6.
(b) (2 points) Do there exist codes with rate equal to that of class C, that can achieve arbitrarily small probability of block error over a channel of capacity 0.4? Justify your answer.
No, there cannot exist such codes, as the rate would exceed the channel capacity and violate the channel coding theorem.
[15 points] Let X = {a, b} and Y = {a, b} be the input and output alphabets for the following two channels:
􏰑1 0.5􏰒 􏰑0.5 0􏰒 R= 0 0.5 and R†= 0.5 1
(a) (2 points) Describe the behaviour of R and R† when each of the symbols a and b are used as input.
R transmits a with no errors and b with error probability of 0.5. The opposite for R†.
(b) Define an arbitrary distribution pX = (p,q) with p ∈ [0,1] and p + q = 1 over X.
Express the following for R:
i. (2 points) The probabilities P(Y = a) and P(Y = b) in terms of p, where Y is a random variable denoting the output of the channel Y.
P(y = a) = 21 (1 + p) and P(y = b) = 12 (1 − p).
ii. (2 points) The entropy of H(Y) in terms of the probability p and the function H2(·) defined by
H2(θ) = −θ · log2 θ − (1 − θ) · log2(1 − θ). H ( Y ) = H 2 ( 21 ( 1 + p ) ) .
iii. (2 points) The mutual information I(X;Y) in terms of p and the function H2 defined above.
I ( X ; Y ) = H ( Y ) − H ( Y | X ) = H 2 ( 12 ( 1 + p ) ) − ( 1 − p ) .
(c) (4 points) Using the previous results or otherwise, compute the input distribution
that achieves the channel capacity for R. First calculate the derivative w.r.t. θ :
H2′(θ)=−log θ . 1−θ
From previous question we have I(X;Y) as a function of p. As I(X,Y) is a concave function of p. To maximise I(X : Y), we solve
0= dI(X;Y) dp
Page 9 of 5 – INFORMATION THEORY – COMP2610/6261

and so we need
= 12 · H 2′ ( 21 ( 1 + p ) ) + 1 = −1 · log 1 + p + 1,
1−p yielding p = 0.6. This gives C(Q) ≈ 0.32.
(d) (3 points) Suppose you used the channels R and R† to send messges by first flipping a fair coin and sending a symbol through R if it landed heads and through R† if it landed tails. Construct a matrix Q that represents the channel defined by this process.
1 1 􏰎0.75 0.25􏰏 Q=2R+2R†= 0.25 0.75
15. (5 marks) For an arbitrary noisy channel Q with N input symbols and M output symbols, show that its capacity ρ satisfies
so, we have
ρ ≤ min{log2 N,log2 M}.
ρ = I(X;Y) = H(X) − H(X|Y) ≤ H(X) ≤ log2 |X| = log2 N ρ = I(X;Y) = H(Y) − H(Y|X) ≤ H(Y) ≤ log2 |Y| = log2 M ρ ≤ min{log2 N,log2 M}.
– – – END OF EXAM – – –
Page 10 of 5 – INFORMATION THEORY – COMP2610/6261

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com