程序代写代做代考 information theory chain COMP2610 / COMP6261 – Information Theory – Lecture 9: Probabilistic Inequalities

COMP2610 / COMP6261 – Information Theory – Lecture 9: Probabilistic Inequalities

COMP2610 / COMP6261 – Information Theory
Lecture 9: Probabilistic Inequalities

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

20 August, 2018

1 / 37

Last time

Mutual information chain rule

Jensen’s inequality

“Information cannot hurt”

Data processing inequality

2 / 37

Review: Data-Processing Inequality

Theorem
if X → Y → Z then: I(X ; Y ) ≥ I(X ; Z )

X is the state of the world, Y is the data gathered and Z is the
processed data

No “clever” manipulation of the data can improve the inferences that
can be made from the data

No processing of Y , deterministic or random, can increase the
information that Y contains about X

3 / 37

This time

Markov’s inequality

Chebyshev’s inequality

Law of large numbers

4 / 37

Outline

1 Properties of expectation and variance

2 Markov’s inequality

3 Chebyshev’s inequality

4 Law of large numbers

5 Wrapping Up

5 / 37

1 Properties of expectation and variance

2 Markov’s inequality

3 Chebyshev’s inequality

4 Law of large numbers

5 Wrapping Up

6 / 37

Expectation and Variance

Let X be a random variable over X , with probability distribution p

Expected value:
E[X ] =


x∈X

x · p(x).

Variance:

V[X ] = E[(X − E[X ])2]
= E[X 2]− (E[X ])2.

Standard deviation is

V[X ]

7 / 37

Properties of expectation

A key property of expectations is linearity:

E

[
n∑

i=1

Xi

]
=

n∑
i=1

E [Xi ]

LHS =

x1∈X1

. . .

xn∈Xn

(
p(x1, . . . , xn) ·

n∑
i=1

xi

)

This holds even if the variables are dependent!

We have for any a ∈ R,
E[aX ] = a · E[X ].

8 / 37

Properties of variance

We have linearity of variance for independent random variables:

V

[
n∑

i=1

Xi

]
=

n∑
i=1

V [Xi ] .

Does not hold if the variables are dependent

(prove this: expand the definition of variance and rely upon E(XiXj ) = E(Xi )E(Xj )
when Xi ⊥ Xj )

We have for any a ∈ R,
V[aX ] = a2 · V[X ].

9 / 37

1 Properties of expectation and variance

2 Markov’s inequality

3 Chebyshev’s inequality

4 Law of large numbers

5 Wrapping Up

10 / 37

Markov’s Inequality
Motivation

1000 school students sit an examination

The busy principal is only told that the average score is 40 (out of 100).

The principal wants to estimate the maximum possible number of students
who scored more than 80

A question about the minimum number of students is trivial to answer. Why?

11 / 37

Markov’s Inequality
Motivation

Call x the number of students who score > 80

Call S is the total score of students who score ≤ 80

We know:

40 · 1000− S = {total score of students who score above 80} > 80x

Exam scores are nonnegative, so certainly S ≥ 0

Thus, 80x < 40 · 1000, or x < 500. Can we formalise this more generally? 12 / 37 Markov’s Inequality Theorem Let X be a nonnegative random variable. Then, for any λ > 0,

p(X ≥ λ) ≤
E[X ]
λ

.

Bounds probability of observing a large outcome

Vacuous if λ < E[X ] 13 / 37 Markov’s Inequality Alternate Statement Corollary Let X be a nonnegative random variable. Then, for any λ > 0,

p(X ≥ λ · E[X ]) ≤
1
λ
.

Observations of nonnegative random variable unlikely to be much larger
than expected value

Vacuous if λ < 1 14 / 37 Markov’s Inequality Proof E[X ] = ∑ x∈X x · p(x) = ∑ x<λ x · p(x) + ∑ x≥λ x · p(x) ≥ ∑ x≥λ x · p(x) nonneg. of random variable ≥ ∑ x≥λ λ · p(x) = λ · p(X ≥ λ). 15 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 16 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 17 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 18 / 37 Markov’s Inequality Illustration from https://justindomke.wordpress.com/2008/06/19/markovs-inequality/ 19 / 37 1 Properties of expectation and variance 2 Markov’s inequality 3 Chebyshev’s inequality 4 Law of large numbers 5 Wrapping Up 20 / 37 Chebyshev’s Inequality Motivation Markov’s inequality only uses the mean of the distribution What about the spread of the distribution (variance)? 0 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 21 / 37 Chebyshev’s Inequality Theorem Let X be a random variable with E[X ] <∞. Then, for any λ > 0,

p(|X − E[X ]| ≥ λ) ≤
V[X ]
λ2

.

Bounds the probability of observing an “unexpected” outcome

Does not require non negativity

Two-sided bound

22 / 37

Chebyshev’s Inequality
Alternate Statement

Corollary
Let X be a random variable with E[X ] <∞. Then, for any λ > 0,

p(|X − E[X ]| ≥ λ ·

V[X ]) ≤
1
λ2
.

Observations are unlikely to occur several standard deviations away from
the mean

23 / 37

Chebyshev’s Inequality
Proof

Define
Y = (X − E[X ])2.

Then, by Markov’s inequality, for any ν > 0,

p(Y ≥ ν) ≤
E[Y ]
ν

.

But,
E[Y ] = V[X ].

Also,
Y ≥ ν ⇐⇒ |X − E[X ]| ≥


ν.

Thus, setting λ =

ν,

p(|X − E[X ]| ≥ λ) ≤
V[X ]
λ2

.

24 / 37

Chebyshev’s Inequality
Illustration

For a binomial X with N trials and success probability θ, we have e.g.

p(|X − Nθ| ≥

2Nθ(1− θ)) ≤
1
2
.

25 / 37

Chebyshev’s Inequality
Example

Suppose we have a coin with bias θ, i.e. p(X = 1) = θ

Say we flip the coin n times, and observe x1, . . . , xn ∈ {0, 1}

We use the maximum likelihood estimator of θ:

θ̂n =
x1 + . . .+ xn

n

Estimate how large n should be such that

p(|θ̂n − θ| ≥ 0.05) ≤ 0.01?

1% probability of a 5% error

(Aside: the need for two parameters here is generic: “Probabably Approximately
Correct”)

26 / 37

Chebyshev’s Inequality
Example

Observe that

E[θ̂n] =
∑n

i=1 E[xi ]
n

= θ

V[θ̂n] =
∑n

i=1 V[xi ]
n2

=
θ(1− θ)

n
.

Thus, applying Chebyshev’s inequality to θ̂n,

p(|θ̂n − θ| > 0.05) ≤
θ(1− θ)

(0.05)2 · n
.

We are guaranteed this is less than 0.01 if

n ≥
θ(1− θ)

(0.05)2(0.01)
.

When θ = 0.5, n ≥ 10, 000 (!)
27 / 37

1 Properties of expectation and variance

2 Markov’s inequality

3 Chebyshev’s inequality

4 Law of large numbers

5 Wrapping Up

28 / 37

Independent and Identically Distributed

Let X1, . . . ,Xn be random variables such that:
Each Xi is independent of Xj

The distribution of Xi is the same as that of Xj

Then, we say that X1, . . . ,Xn are independent and identically distributed
(or iid)

Example: For n independent flips of an unbiased coin, X1, . . . ,Xn are iid
from Bernoulli

(
1
2

)

29 / 37

Law of Large Numbers

Theorem
Let X1, . . . ,Xn be a sequence of iid random variables, with

E[Xi ] = µ

and V[Xi ] <∞. Define X̄n = X1 + . . .+ Xn n . Then, for any � > 0,

lim
n→∞

p(|X̄n − µ| > �) = 0.

Given enough trials, the empirical “success frequency” will be close to the
expected value

30 / 37

Law of Large Numbers
Proof

Since the Xi ’s are identically distributed,

E[X̄n] = µ.

Since the Xi ’s are independent,

V[X̄n] = V
[

X1 + . . .+ Xn
n

]
=

V [X1 + . . .+ Xn]
n2

=
nσ2

n2

=
σ2

n
.

31 / 37

Law of Large Numbers
Proof

Applying Chebyshev’s inequality to X̄n,

p(|X̄n − µ| ≥ �) ≤
V[X̄n]
�2

=
σ2

n�2
.

For any fixed � > 0, as n→∞, the right hand side→ 0.

Thus,
p(|X̄n − µ| < �)→ 1. 32 / 37 Law of Large Numbers Illustration N = 1000 trials with Bernoulli random variable with parameter 12 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 33 / 37 Law of Large Numbers Illustration N = 50000 trials with Bernoulli random variable with parameter 12 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 34 / 37 1 Properties of expectation and variance 2 Markov’s inequality 3 Chebyshev’s inequality 4 Law of large numbers 5 Wrapping Up 35 / 37 Summary & Conclusions Markov’s inequality Chebyshev’s inequality Law of large numbers 36 / 37 Next time Ensembles and sequences Typical sets Approximation Equipartition (AEP) 37 / 37 Properties of expectation and variance Markov's inequality Chebyshev's inequality Law of large numbers Wrapping Up