COMP2610/COMP6261
Tutorial 5 Sample Solutions
Tutorial 5: Probabilistic inequalities and Mutual Information
Young Lee and Bob Williamson
Tutors: Debashish Chakraborty and Zakaria Mhammedi
Week 5 (21st – 25th August), Semester 2, 2017
1. Consider a discrete variable X taking on values from the set X . Let pi be the probability of each state, with
i = 1, . . . , |X |. Denote the vector of probabilities by p. We saw in lectures that the entropy of X satisfies:
H(X) ≤ log|X |,
with equality if and only if pi =
1
|X |
for all i, i.e.p is uniform. Prove the above statement using Gibbs’ inequality,
which says
|X |∑
i=1
pi log2
pi
qi
≥ 0
for any probability distributions p,q over |X | outcomes, with equality if and only if p = q.
Solution.
Gibb’s inequality tells us that for any two probability vectors p = (p1, . . . , p|X |) and q = (q1, . . . , q|X |):
|X |∑
i=1
pi log
pi
qi
≥ 0
with equality if and only if p = q. If we take q to be the vector representing the uniform distribution
q1 = . . . = q|X | =
1
|X |
, then we get
0 ≤
|X |∑
i=1
pi log
pi
1
|X |
=
|X |∑
i=1
pi log pi +
|X |∑
i=1
pi log|X | = −H(p) + log|X |
with equality if and only if p is the uniform distribution. Moving H(p) to the other side gives the inequality.
1
2. LetX be a discrete random variable. Show that the entropy of a function ofX is less than or equal to the entropy
of X by justifying the following steps:
H(X, g(X))
(a)
= H(X) +H(g(X)|X)
(b)
= H(X);
H(X, g(X))
(c)
= H(g(X)) +H(X|g(X))
(d)
≥ H(g(X)).
Thus H(g(X)) ≤ H(X).
Solution.
(a) This is using the chain rule of entropy,
i.e. H(X,Y ) = H(X) +H(Y | X) where Y = g(X)
(b) GivenX , we can determine g(X) since it is fixed, being a function ofX . This means no uncertainty remains
about g(X) when X is given. Thus, H(g(X) | X) = 0 since
∑
x p(x)p(g(X) | X = x) = 0.
(c) This is also using the chain rule of entropy,
i.e. H(X,Y ) = H(Y ) +H(X | Y ) where Y = g(X)
(d) In this case, H(X | g(X)) ≥ 0 since the conditional entropy of a discrete random variable is non-negative.
If g(X) has one-to-one mapping with X , then H(X, g(X)) ≥ H(g(X)).
Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)).
3. Random variables X , Y , Z are said to form a Markov chain in that order (denoted by X → Y → Z) if their
joint probability distribution can be written as:
p(X,Y, Z) = p(X) · p(Y |X) · p(Z|Y )
(a) Suppose (X,Y, Z) forms a Markov chain. Is it possible for I(X;Y ) = I(X;Z)? If yes, give an example of
X,Y, Z where this happens. If no, explain why not.
(b) Suppose (X,Y, Z) does not form a Markov chain. Is it possible for I(X;Y ) ≥ I(X;Z)? If yes, give an
example of X,Y, Z where this happens. If no, explain why not.
Solution.
(a) Yes; pick Z = Y .
Reason: The data processing inequality guarantees I(X;Y ) ≥ I(X;Z). Here we want to verify that
equality is possible. If we look at the proof of the data processing inequality, we just need to find a Z where
I(X;Y |Z) = 0.
For Z = Y , intuitively, conditioning on Z, the reduction in uncertainty in X when we know Y is zero,
because Z already tells us everything that Y can. Formally,I(X;Y ) = I(X;Z) because the random
variables Y and Z have the same distribution. Note: to formally check that Z is conditionally independent
of X given Y , we can check p(Z = z,X = x|Y = y) = p(Z = z|Y = y) · p(X = x|Y = y) for all
possible x, y, z. The reason is that the left and right hand sides are zero when y 6= z; and when y = z, they
both equal p(X = x|Y = y) as p(Z = z|X = x, Y = y) = 1 in this case.
(b) Yes; pick X , Z independent, and let Y = X + Z (assuming the outcomes are numeric).
Reason: Z is not conditionally independent ofX given Y ; intuitively, knowingX +Z andX tells us what
Z is. So (X,Y, Z) does not form a Markov chain. However, since X , Z are independent, I(X;Z) = 0.
Since mutual information is non-negative, I(X;Y ) ≥ 0 = I(X;Z).
2
4. If X → Y → Z, then show that
(a) I(X;Z) ≤ I(X;Y )
(b) I(X;Y |Z) ≤ I(X;Y )
Proof in lecture 9
5. A coin is known to land heads with probability 1
5
. The coin is flipped N times for some even integer N .
(a) Using Markov’s inequality, provide a bound on the probability of observing N
2
or more heads.
(b) Using Chebyshev’s inequality, provide a bound on the probability of observing N
2
or more heads. Express
your answer in terms of N .
(c) For N ∈ {2, 4, . . . , 20}, in a single plot, show the bounds from part (a) and (b), as well as the exact
probability of observing N
2
or more heads.
Solution.
X1, . . . , XN represents N flips, where, independent bernoulli random variable, Xi = 1 represents observing
head from a coin flip and Xi = 0 represents observing tail. Suppose X̂N = 1N
∑N
i=1Xi. So, the probability of
observing N
2
heads can be expressed as p(X̂N ≥ 12 ) and p(Xi = 1) =
1
5
for each i.
(a) Using Markov’s Inequality,
p(X̂N ≥
1
2
) ≤
E[X̂N ]
N
2
=
∑N
i=1
E[Xi]
N
1
2
=
1
5
1
2
=
2
5
∴ p(X̂N ≥
N
2
) ≤
2
5
(b) We need to calculate the variance of the bernoulli random variable: V ar(X) = p(1− p)
∴ V ar[Xi] = (
1
5
)(1−
1
5
) =
4
25
Using the definition of variance and its properties,
V ar(X̂N ) = V ar[
1
N
N∑
i=1
Xi] =
∑N
i=1 V ar[Xi]
N2
=
N( 4
25
)
N2
=
4
25N
Using Chebyshev’s inequality,
p(| X̂N − E[X̂N ] |≥ λ) ≤
V ar(X̂N )
λ2
p(| X̂N −
1
5
|≥
3
10
) ≤
4
25N
( 3
10
)2
p(X̂N ≥
1
2
) ≤
16
9N
(c) The exact probability of a k heads is given by the binomial distribution:
P (X = k) =
(
N
k
)
(
1
5
)k(
4
5
)N−k
3
So, the probability of seeing N/2 or more heads is
P (X ≥ N/2) =
N∑
k=N/2
P (X = k)
=
N∑
k=N/2
(
N
k
)
(
1
5
)k(
4
5
)N−k
Another way to calculate the exact probability
p(X̂N ≥
1
2
) = 1− p(X̂N <
1
2
)
This can be done in Matlab using (1-binocdf(floor(0.5.*n-0.5), n, 0.2))
Here floor(0.5.*n-0.5) simply brings the value of n to an integer less than n/2 for each value of n. For
example, a value of n=10 would lead floor(0.5.*n-0.5) value of 4, which is what we want.
The code for the plot above is included below:
1 n = 2 : 2 : 2 0 ;
2
3 % Markov I n e q u a l i t y
4 y_m = 2 / 5 ;
5
6 % Chebyshev I n e q u a l i t y
7 y_c = 16 . / ( 9 .∗ n ) ;
8
9 % Exac t P r o b a b i l i t i e s
10 y_e = 1−b i n o cd f ( f l o o r ( 0 . 5 . ∗ n−0.5) , n , 0 . 2 ) ;
11
12 p l o t ( n , y_c , ’ k ’ )
13 ho ld on ;
14 p l o t ( [ 2 2 0 ] , [ y_m y_m ] , ’m−’ )
15 ho ld on ;
16 p l o t ( n , y_e , ’ b ’ )
17 ho ld on ;
18 s e t ( gca , ’ f o n t s i z e ’ , 14)
19
20 t i t l e ( ’Markov ’ ’ s bound , Chebyshev ’ ’ s bound and e x a c t p r o b a b i l i t y o f o b e s e r v i n g more t h an
N/2 heads ’ )
21 y l a b e l ( ’ P r o b a b i l i t y ’ )
22 x l a b e l ( ’Number o f co i n f l i p s (N) ’ )
23 l e g end ( ’ Chebyshev ’ , ’Markov ’ , ’ Exac t ’ ) ;
4