COMP2610 / COMP6261 – Information Theory – Lecture 9: Probabilistic Inequalities
COMP2610 / COMP6261 – Information Theory
Lecture 9: Probabilistic Inequalities
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
20 August, 2018
1 / 37
Last time
Mutual information chain rule
Jensen’s inequality
“Information cannot hurt”
Data processing inequality
2 / 37
Review: Data-Processing Inequality
Theorem
if X → Y → Z then: I(X ; Y ) ≥ I(X ; Z )
X is the state of the world, Y is the data gathered and Z is the
processed data
No “clever” manipulation of the data can improve the inferences that
can be made from the data
No processing of Y , deterministic or random, can increase the
information that Y contains about X
3 / 37
This time
Markov’s inequality
Chebyshev’s inequality
Law of large numbers
4 / 37
Outline
1 Properties of expectation and variance
2 Markov’s inequality
3 Chebyshev’s inequality
4 Law of large numbers
5 Wrapping Up
5 / 37
1 Properties of expectation and variance
2 Markov’s inequality
3 Chebyshev’s inequality
4 Law of large numbers
5 Wrapping Up
6 / 37
Expectation and Variance
Let X be a random variable over X , with probability distribution p
Expected value:
E[X ] =
∑
x∈X
x · p(x).
Variance:
V[X ] = E[(X − E[X ])2]
= E[X 2]− (E[X ])2.
Standard deviation is
√
V[X ]
7 / 37
Properties of expectation
A key property of expectations is linearity:
E
[
n∑
i=1
Xi
]
=
n∑
i=1
E [Xi ]
LHS =
∑
x1∈X1
. . .
∑
xn∈Xn
(
p(x1, . . . , xn) ·
n∑
i=1
xi
)
This holds even if the variables are dependent!
We have for any a ∈ R,
E[aX ] = a · E[X ].
8 / 37
Properties of variance
We have linearity of variance for independent random variables:
V
[
n∑
i=1
Xi
]
=
n∑
i=1
V [Xi ] .
Does not hold if the variables are dependent
(prove this: expand the definition of variance and rely upon E(XiXj ) = E(Xi )E(Xj )
when Xi ⊥ Xj )
We have for any a ∈ R,
V[aX ] = a2 · V[X ].
9 / 37
1 Properties of expectation and variance
2 Markov’s inequality
3 Chebyshev’s inequality
4 Law of large numbers
5 Wrapping Up
10 / 37
Markov’s Inequality
Motivation
1000 school students sit an examination
The busy principal is only told that the average score is 40 (out of 100).
The principal wants to estimate the maximum possible number of students
who scored more than 80
A question about the minimum number of students is trivial to answer. Why?
11 / 37
Markov’s Inequality
Motivation
Call x the number of students who score > 80
Call S is the total score of students who score ≤ 80
We know:
40 · 1000− S = {total score of students who score above 80} > 80x
Exam scores are nonnegative, so certainly S ≥ 0
Thus, 80x < 40 · 1000, or x < 500. Can we formalise this more generally? 12 / 37 Markov’s Inequality Theorem Let X be a nonnegative random variable. Then, for any λ > 0,
p(X ≥ λ) ≤
E[X ]
λ
.
Bounds probability of observing a large outcome
Vacuous if λ < E[X ] 13 / 37 Markov’s Inequality Alternate Statement Corollary Let X be a nonnegative random variable. Then, for any λ > 0,
p(X ≥ λ · E[X ]) ≤
1
λ
.
Observations of nonnegative random variable unlikely to be much larger
than expected value
Vacuous if λ < 1 14 / 37 Markov’s Inequality Proof E[X ] = ∑ x∈X x · p(x) = ∑ x<λ x · p(x) + ∑ x≥λ x · p(x) ≥ ∑ x≥λ x · p(x) nonneg. of random variable ≥ ∑ x≥λ λ · p(x) = λ · p(X ≥ λ). 15 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 16 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 17 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 18 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 19 / 37 1 Properties of expectation and variance 2 Markov’s inequality 3 Chebyshev’s inequality 4 Law of large numbers 5 Wrapping Up 20 / 37 Chebyshev’s Inequality Motivation Markov’s inequality only uses the mean of the distribution What about the spread of the distribution (variance)? 0 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 21 / 37 Chebyshev’s Inequality Theorem Let X be a random variable with E[X ] <∞. Then, for any λ > 0,
p(|X − E[X ]| ≥ λ) ≤
V[X ]
λ2
.
Bounds the probability of observing an “unexpected” outcome
Does not require non negativity
Two-sided bound
22 / 37
Chebyshev’s Inequality
Alternate Statement
Corollary
Let X be a random variable with E[X ] <∞. Then, for any λ > 0,
p(|X − E[X ]| ≥ λ ·
√
V[X ]) ≤
1
λ2
.
Observations are unlikely to occur several standard deviations away from
the mean
23 / 37
Chebyshev’s Inequality
Proof
Define
Y = (X − E[X ])2.
Then, by Markov’s inequality, for any ν > 0,
p(Y ≥ ν) ≤
E[Y ]
ν
.
But,
E[Y ] = V[X ].
Also,
Y ≥ ν ⇐⇒ |X − E[X ]| ≥
√
ν.
Thus, setting λ =
√
ν,
p(|X − E[X ]| ≥ λ) ≤
V[X ]
λ2
.
24 / 37
Chebyshev’s Inequality
Illustration
For a binomial X with N trials and success probability θ, we have e.g.
p(|X − Nθ| ≥
√
2Nθ(1− θ)) ≤
1
2
.
25 / 37
Chebyshev’s Inequality
Example
Suppose we have a coin with bias θ, i.e. p(X = 1) = θ
Say we flip the coin n times, and observe x1, . . . , xn ∈ {0, 1}
We use the maximum likelihood estimator of θ:
θ̂n =
x1 + . . .+ xn
n
Estimate how large n should be such that
p(|θ̂n − θ| ≥ 0.05) ≤ 0.01?
1% probability of a 5% error
(Aside: the need for two parameters here is generic: “Probabably Approximately
Correct”)
26 / 37
Chebyshev’s Inequality
Example
Observe that
E[θ̂n] =
∑n
i=1 E[xi ]
n
= θ
V[θ̂n] =
∑n
i=1 V[xi ]
n2
=
θ(1− θ)
n
.
Thus, applying Chebyshev’s inequality to θ̂n,
p(|θ̂n − θ| > 0.05) ≤
θ(1− θ)
(0.05)2 · n
.
We are guaranteed this is less than 0.01 if
n ≥
θ(1− θ)
(0.05)2(0.01)
.
When θ = 0.5, n ≥ 10, 000 (!)
27 / 37
1 Properties of expectation and variance
2 Markov’s inequality
3 Chebyshev’s inequality
4 Law of large numbers
5 Wrapping Up
28 / 37
Independent and Identically Distributed
Let X1, . . . ,Xn be random variables such that:
Each Xi is independent of Xj
The distribution of Xi is the same as that of Xj
Then, we say that X1, . . . ,Xn are independent and identically distributed
(or iid)
Example: For n independent flips of an unbiased coin, X1, . . . ,Xn are iid
from Bernoulli
(
1
2
)
29 / 37
Law of Large Numbers
Theorem
Let X1, . . . ,Xn be a sequence of iid random variables, with
E[Xi ] = µ
and V[Xi ] <∞. Define X̄n = X1 + . . .+ Xn n . Then, for any � > 0,
lim
n→∞
p(|X̄n − µ| > �) = 0.
Given enough trials, the empirical “success frequency” will be close to the
expected value
30 / 37
Law of Large Numbers
Proof
Since the Xi ’s are identically distributed,
E[X̄n] = µ.
Since the Xi ’s are independent,
V[X̄n] = V
[
X1 + . . .+ Xn
n
]
=
V [X1 + . . .+ Xn]
n2
=
nσ2
n2
=
σ2
n
.
31 / 37
Law of Large Numbers
Proof
Applying Chebyshev’s inequality to X̄n,
p(|X̄n − µ| ≥ �) ≤
V[X̄n]
�2
=
σ2
n�2
.
For any fixed � > 0, as n→∞, the right hand side→ 0.
Thus,
p(|X̄n − µ| < �)→ 1.
32 / 37
Law of Large Numbers
Illustration
N = 1000 trials with Bernoulli random variable with parameter 12
0 100 200 300 400 500 600 700 800 900 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33 / 37
Law of Large Numbers
Illustration
N = 50000 trials with Bernoulli random variable with parameter 12
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
34 / 37
1 Properties of expectation and variance
2 Markov’s inequality
3 Chebyshev’s inequality
4 Law of large numbers
5 Wrapping Up
35 / 37
Summary & Conclusions
Markov’s inequality
Chebyshev’s inequality
Law of large numbers
36 / 37
Next time
Ensembles and sequences
Typical sets
Approximation Equipartition (AEP)
37 / 37
Properties of expectation and variance
Markov's inequality
Chebyshev's inequality
Law of large numbers
Wrapping Up