COMP2610 / COMP6261 – Information Theory – Lecture 10: Typicality and Asymptotic Equipartition Property
COMP2610 / COMP6261 – Information Theory
Lecture 10: Typicality and Asymptotic Equipartition Property
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
21 August, 2018
1 / 31
Last time
Markov’s inequality
Chebyshev’s inequality
Law of large numbers
2 / 31
Law of Large Numbers
Theorem
Let X1, . . . ,Xn be a sequence of iid random variables, with
E[Xi ] = µ
and V[Xi ] <∞. Define X̄n = X1 + . . .+ Xn n . Then, for any β > 0,
lim
n→∞
p(|X̄n − µ| < β) = 1. This is also called X̄n → µ in probability. Definition: For random variables v1, v2, . . ., we say vn → v in probability if for all β > 0 limn→∞ P(|vn − v | > β) = 0.
β is fixed (not shrinking like 1n ). Not max/min. Reduction in variability.
3 / 31
This time
Ensembles and sequences
Typical sets
Asymptotic Equipartition Property (AEP)
4 / 31
1 Ensembles and sequences
Counting Types of Sequences
2 Typical sets
3 Asymptotic Equipartition Property (AEP)
4 Wrapping Up
5 / 31
Ensembles
Ensemble
An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking
values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}.
We will call AX the alphabet of the ensemble
6 / 31
Ensembles
Example: Bent Coin
Let X be an ensemble with outcomes h for
heads with probability 0.9 and t for tails
with probability 0.1.
The outcome set is AX = {h, t}
The probabilities are
PX = {ph = 0.9, pt = 0.1}
7 / 31
Extended Ensembles
We can also consider blocks of outcomes, which will be useful to describe
sequences:
Example (Coin Flips):
hhhhthhththh→ hh hh th ht ht hh (6 × 2 outcome blocks)
→ hhh hth hth thh (4 × 3 outcome blocks)
→ hhhh thht hthh (3 × 4 outcome blocks)
Extended Ensemble
Let X be a single ensemble. The extended ensemble of blocks of size N
is denoted X N . Outcomes from X N are denoted x = (x1, x2, . . . , xN). The
probability of x is defined to be P(x) = P(x1)P(x2) . . .P(xN).
8 / 31
Extended Ensembles
We can also consider blocks of outcomes, which will be useful to describe
sequences:
Example (Coin Flips):
hhhhthhththh→ hh hh th ht ht hh (6 × 2 outcome blocks)
→ hhh hth hth thh (4 × 3 outcome blocks)
→ hhhh thht hthh (3 × 4 outcome blocks)
Extended Ensemble
Let X be a single ensemble. The extended ensemble of blocks of size N
is denoted X N . Outcomes from X N are denoted x = (x1, x2, . . . , xN). The
probability of x is defined to be P(x) = P(x1)P(x2) . . .P(xN).
8 / 31
Extended Ensembles
Example: Bent Coin
Let X be an ensemble with outcomes
AX = {h, t} with ph = 0.9 and pt = 0.1.
Consider X 4 – i.e., 4 flips of the coin.
AX 4 = {hhhh, hhht, hhth, . . . , tttt}
P(hhhh) = (0.9)4 ≈ 0.6561
P(tttt) = (0.1)4 = 0.0001
P(hthh) = 0.9 · 0.1 · 0.9 · 0.9 = (0.9)3(0.1) ≈ 0.0729
P(htht) = 0.9 · 0.1 · 0.9 · 0.1 = (0.9)2(0.1)2 ≈ 0.0081.
9 / 31
Extended Ensembles
Example: Bent Coin
Let X be an ensemble with outcomes
AX = {h, t} with ph = 0.9 and pt = 0.1.
Consider X 4 – i.e., 4 flips of the coin.
AX 4 = {hhhh, hhht, hhth, . . . , tttt}
P(hhhh) = (0.9)4 ≈ 0.6561
P(tttt) = (0.1)4 = 0.0001
P(hthh) = 0.9 · 0.1 · 0.9 · 0.9 = (0.9)3(0.1) ≈ 0.0729
P(htht) = 0.9 · 0.1 · 0.9 · 0.1 = (0.9)2(0.1)2 ≈ 0.0081.
9 / 31
Extended Ensembles
Example: Bent Coin
Entropy of extended ensembles
We can view X 4 as comprising 4 independent random variables, based on
the ensemble X
Entropy is additive for independent random variables
Thus,
H(X 4) = 4H(X ) = 4. (−0.9 log2 0.9− 0.1 log2 0.1) = 1.88bits.
More generally,
H(X N) = NH(X ).
10 / 31
Counting Types of Sequences
Criteria for dividing 2N sequences into types
In the bent coin example,
(0.9)2(0.1)2 = P(hhtt)
= P(htht)
= P(htth)
= P(thht)
= P(thth)
= P(tthh).
The order of outcomes in the sequence is irrelevant
11 / 31
Counting Types of Sequences
Let X be an ensemble with alphabet AX = {a1, . . . , aI}.
Let p(X = ai) = pi .
For a sequence x = x1, x2, . . . , xN , how to compute p(x)?
let ni = # of times symbol ai appears in x (symbol count)
Given the ni ’s, we can compute the probability of seeing x:
P(x) = P(x1) · P(x2) · . . . · P(xN)
= P(a1)
n1 · P(a2)n2 · . . . · P(aI)nI
= pn11 · p
n2
2 . . . p
nI
I
Sufficient statistics: {n1, n2, . . . , nI}. Use it as a criteria of partitioning.
12 / 31
Counting Types of Sequences
Sequence Types
Each unique choice of (n1, n2, . . . , nI) gives a different type of sequence
4 heads, (3 heads, 1 tail), (2 heads, 2 tails), …
Sequences in each type are equiprobable.
For a given type of sequence how many sequences are there with these
symbol counts?
# of sequences with ni copies of ai =
N!
n1!n2! . . . nI!
(
N
n1
)(
N − n1
n2
)(
N − n1 − n2
n3
)
. . .
=
N!
n1!(N − n1)!
· (N − n1)!
n2!(N − n1 − n2)!
· (N − n1 − n2)!
n3!(N − n1 − n2 − n3)!
. . .
13 / 31
Counting Types of Sequences
Example
Probability of types
Let A = {a, b, c} with P(a) = 0.2, P(b) = 0.3, P(c) = 0.5.
Each sequence of type (na, nb, nc) = (2, 1, 3) has length 6 and probability
(0.2)2(0.3)1(0.5)3 = 0.0015.
There are 6!2!1!3! = 60 such sequences.
The probability x is of type (2, 1, 3) is (0.0015) · 60 = 0.09.
Study probabilities at the level of types (most likely, average/typical)
14 / 31
Counting Types of Sequences
Example
Probability of types
Let A = {a, b, c} with P(a) = 0.2, P(b) = 0.3, P(c) = 0.5.
Each sequence of type (na, nb, nc) = (2, 1, 3) has length 6 and probability
(0.2)2(0.3)1(0.5)3 = 0.0015.
There are 6!2!1!3! = 60 such sequences.
The probability x is of type (2, 1, 3) is (0.0015) · 60 = 0.09.
Study probabilities at the level of types (most likely, average/typical)
14 / 31
Counting Types of Sequences
Example
Probability of types
Let A = {a, b, c} with P(a) = 0.2, P(b) = 0.3, P(c) = 0.5.
Each sequence of type (na, nb, nc) = (2, 1, 3) has length 6 and probability
(0.2)2(0.3)1(0.5)3 = 0.0015.
There are 6!2!1!3! = 60 such sequences.
The probability x is of type (2, 1, 3) is (0.0015) · 60 = 0.09.
Study probabilities at the level of types (most likely, average/typical)
14 / 31
Counting Types of Sequences
Example
Probability of types
Let A = {a, b, c} with P(a) = 0.2, P(b) = 0.3, P(c) = 0.5.
Each sequence of type (na, nb, nc) = (2, 1, 3) has length 6 and probability
(0.2)2(0.3)1(0.5)3 = 0.0015.
There are 6!2!1!3! = 60 such sequences.
The probability x is of type (2, 1, 3) is (0.0015) · 60 = 0.09.
Study probabilities at the level of types (most likely, average/typical)
14 / 31
1 Ensembles and sequences
Counting Types of Sequences
2 Typical sets
3 Asymptotic Equipartition Property (AEP)
4 Wrapping Up
15 / 31
Extended Ensembles
Example
With ph = 0.75, what are the probabilities for X N?
N = 2
x P(x)
hh 0.5625
ht 0.1875
th 0.1875
tt 0.0625
N = 3
x P(x)
hhh 0.4219
hht 0.1406
hth 0.1406
thh 0.1406
htt 0.0469
tht 0.0469
tth 0.0469
ttt 0.0156
N = 4
x P(x) x P(x)
hhhh 0.3164 thht 0.0352
hhht 0.1055 thth 0.0352
hhth 0.1055 tthh 0.0352
hthh 0.1055 httt 0.0117
thhh 0.1055 thtt 0.0117
htht 0.0352 ttht 0.0117
htth 0.0352 ttth 0.0117
hhtt 0.0352 tttt 0.0039
16 / 31
Extended Ensembles
Example
With ph = 0.75, what are the probabilities for X N?
N = 2
x P(x)
hh 0.5625
ht 0.1875
th 0.1875
tt 0.0625
N = 3
x P(x)
hhh 0.4219
hht 0.1406
hth 0.1406
thh 0.1406
htt 0.0469
tht 0.0469
tth 0.0469
ttt 0.0156
N = 4
x P(x) x P(x)
hhhh 0.3164 thht 0.0352
hhht 0.1055 thth 0.0352
hhth 0.1055 tthh 0.0352
hthh 0.1055 httt 0.0117
thhh 0.1055 thtt 0.0117
htht 0.0352 ttht 0.0117
htth 0.0352 ttth 0.0117
hhtt 0.0352 tttt 0.0039
16 / 31
Extended Ensembles
Example
With ph = 0.75, what are the probabilities for X N?
N = 2
x P(x)
hh 0.5625
ht 0.1875
th 0.1875
tt 0.0625
N = 3
x P(x)
hhh 0.4219
hht 0.1406
hth 0.1406
thh 0.1406
htt 0.0469
tht 0.0469
tth 0.0469
ttt 0.0156
N = 4
x P(x) x P(x)
hhhh 0.3164 thht 0.0352
hhht 0.1055 thth 0.0352
hhth 0.1055 tthh 0.0352
hthh 0.1055 httt 0.0117
thhh 0.1055 thtt 0.0117
htht 0.0352 ttht 0.0117
htth 0.0352 ttth 0.0117
hhtt 0.0352 tttt 0.0039
16 / 31
Observations
As N increases, there is an increasing spread of probabilities
The most likely single sequence will always be the all h’s
However, for N = 4, the most likely sequence type is 3 h’s and 1 t
Not surprising because 3 = N · ph, pretty much average case.
17 / 31
Symbol Frequency in Long Sequences
To judge if a sequence is typical/average, a natural question to ask is:
How often does each symbol appear in a sequence x from X N?
Intuitively, in a sequence of length N, let ai appear for ni times.
Then in expectation
ni ≈ N · P(ai)
Note pi = P(ai), and
P(x) = P(a1)
n1P(a2)
n2 . . .P(aI)
nI ≈ pNp11 p
Np2
2 . . . p
NpI
I
So the information content − log2 P(x) of that sequence is approximately
−p1N log2 p1 − . . .− pIN log2 pI = −N
I∑
i=1
pi log2 pi = NH(X )
18 / 31
Symbol Frequency in Long Sequences
To judge if a sequence is typical/average, a natural question to ask is:
How often does each symbol appear in a sequence x from X N?
Intuitively, in a sequence of length N, let ai appear for ni times.
Then in expectation
ni ≈ N · P(ai)
Note pi = P(ai), and
P(x) = P(a1)
n1P(a2)
n2 . . .P(aI)
nI ≈ pNp11 p
Np2
2 . . . p
NpI
I
So the information content − log2 P(x) of that sequence is approximately
−p1N log2 p1 − . . .− pIN log2 pI = −N
I∑
i=1
pi log2 pi = NH(X )
18 / 31
Typical Sets
We want to consider elements x that have − log2 P(x) “close” to NH(X )
Typical Set
For “closeness” β > 0 the typical set TNβ for X N is
TNβ
def
= {x : |− log2 P(x)− NH(X )| < Nβ}
=
{
x :
∣∣∣∣− 1N log2 P(x)− H(X )
∣∣∣∣ < β
}
Union of types
19 / 31
Typical Sets
We want to consider elements x that have − log2 P(x) “close” to NH(X )
Typical Set
For “closeness” β > 0 the typical set TNβ for X N is
TNβ
def
= {x : |− log2 P(x)− NH(X )| < Nβ}
=
{
x :
∣∣∣∣− 1N log2 P(x)− H(X )
∣∣∣∣ < β
}
Union of types
✦
❧�❣✷ P✭①✮
✁◆❍✭❳✮
❚✂✄
What when β = 0 (and replace < by ≤)?
Criterion based on information content. Other criterion (KL divergence)?
19 / 31
Typical Sets
The name “typical” is used since x ∈ TNβ will have roughly p1N
occurrences of symbol a1, p2N of a2, . . ., pK N of aK .
x log2(P (x))
...1...................1.....1....1.1.......1........1...........1.....................1.......11... !50.1
......................1.....1.....1.......1....1.........1.....................................1.... !37.3
........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. !65.9
1.1...1................1.......................11.1..1............................1.....1..1.11..... !56.4
...11...........1...1.....1.1......1..........1....1...1.....1............1......................... !53.2
..............1......1.........1.1.......1..........1............1...1......................1....... !43.7
.....1........1.......1...1............1............1...........1......1..11........................ !46.8
.....1..1..1...............111...................1...............1.........1.1...1...1.............1 !56.4
.........1..........1.....1......1..........1....1..............................................1... !37.3
......1........................1..............1.....1..1.1.1..1...................................1. !43.7
1.......................1..........1...1...................1....1....1........1..11..1.1...1........ !56.4
...........11.1.........1................1......1.....................1............................. !37.3
.1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... !56.4
......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. !59.5
............11.1......1....1..1............................1.......1..............1.......1......... !46.8
.................................................................................................... !15.2
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 !332.1
Randomly drawn sequences for P(1) = 0.1. Note: H(X ) ≈ 0.47
20 / 31
Typical Sets
Properties
Typical sequences are nearly equiprobable: Every x ∈ TNβ has
2−N(H(X)+β) ≤ P(x) ≤ 2−N(H(X)−β).
Variation is small when β is small
Number of sequences in the typical set: For any N, β,
|TNβ| ≤ 2N(H(X)+β).
21 / 31
Typical Sets
Proof of Cardinality Bound
For every x ∈ TNβ ,
p(x) ≥ 2−N(H(X)−β).
Thus,
1 =
∑
x
p(x)
≥
∑
x∈TNβ
p(x)
≥
∑
x∈TNβ
2−N(H(X)−β)
= 2−N(H(X)−β) · |TNβ|.
Thus
|TNβ| ≤ 2N(H(X)+β)
22 / 31
Typical Sets
Most Likely Sequence
The most likely sequence may not belong to the typical set
e.g. with ph = 0.75, we have
−1
4
log2 P(hhhh) = 0.4150
whereas H(X ) = 0.8113
The most likely single sequence→ hhhh
The most likely single sequence type→ {hhht, hthh, . . .}
23 / 31
Typical Sets
Most Likely Sequence
Probability of most likely sequence decays like (ph)N (ph = 0.75)
Sequences with N · ph heads contain much more total probability mass
0 50 100 150 200 250 300
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Block Size (N)
T
o
ta
l
P
ro
b
a
b
il
it
y
Type (3N/4, N/4) sequences
Type (N, 0) sequences
Blue curve corresponds to typical set with β = 0. What if β > 0?
24 / 31
1 Ensembles and sequences
Counting Types of Sequences
2 Typical sets
3 Asymptotic Equipartition Property (AEP)
4 Wrapping Up
25 / 31
Asymptotic︸ ︷︷ ︸
Eventually
Equipartition︸ ︷︷ ︸
Equally Divided
Property
Informally
Asymptotic Equipartition Property (Informal)
As N →∞, log2 P(x1, . . . , xN) is close to −NH(X ) with high probability.
For large block sizes “almost all sequences are typical” (i.e., in TNβ)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
4.4: Typicality 79
Figure 4.11. Anatomy of the typical set T . For p1 = 0.1 and N = 100 and N = 1000, these graphs
show n(r), the number of strings containing r 1s; the probability P (x) of a single string
that contains r 1s; the same probability on a log scale; and the total probability n(r)P (x) of
all strings that contain r 1s. The number r is on the horizontal axis. The plot of log2 P (x)
also shows by a dotted line the mean value of log2 P (x) = !NH2(p1), which equals !46.9
when N = 100 and !469 when N = 1000. The typical set includes only the strings that
have log2 P (x) close to this value. The range marked T shows the set TN! (as defined in
section 4.4) for N = 100 and ! = 0.29 (left) and N = 1000, ! = 0.09 (right).
N = 100 N = 1000
n(r) =
!
N
r
”
0
2e+28
4e+28
6e+28
8e+28
1e+29
1.2e+29
0 10 20 30 40 50 60 70 80 90 100
0
5e+298
1e+299
1.5e+299
2e+299
2.5e+299
3e+299
0 100 200 300 400 500 600 700 800 9001000
P (x) = pr1(1! p1)N!r
0
1e-05
2e-05
0 10 20 30 40 50 60 70 80 90 100
0
1e-05
2e-05
0 1 2 3 4 5
log2 P (x)
-350
-300
-250
-200
-150
-100
-50
0
0 10 20 30 40 50 60 70 80 90 100
T
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 100 200 300 400 500 600 700 800 9001000
T
n(r)P (x) =
!
N
r
”
pr1(1! p1)N!r
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 10 20 30 40 50 60 70 80 90 100
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 100 200 300 400 500 600 700 800 9001000
r r
Probability sequence x has r heads for N = 100 (left) and N = 1000 (right). Here
P(X = head) = 0.1.
26 / 31
Asymptotic Equipartition Property
Formally
Asymptotic Equipartition Property
If x1, x2, . . . are i.i.d. with distribution P then, in probability,
− 1
N
log2 P(x1, . . . , xN)→ H(X ).
In precise language:
(∀β > 0) lim
N→∞
p
(∣∣∣∣− 1N log2 P(x1, . . . , xN)− H(X )
∣∣∣∣ < β ) = 1. Exactly the probability of x ∈ TNβ . Recall definition: for random variables v1, v2, . . ., we say vN → v in probability if for all β > 0 limN→∞ P(|vN − v | > β) = 0
Here vN corresponds to − 1N log2 P(x1, . . . , xN).
27 / 31
Asymptotic Equipartition Property
Formally
Asymptotic Equipartition Property
If x1, x2, . . . are i.i.d. with distribution P then, in probability,
− 1
N
log2 P(x1, . . . , xN)→ H(X ).
In precise language:
(∀β > 0) lim
N→∞
p
(∣∣∣∣− 1N log2 P(x1, . . . , xN)− H(X )
∣∣∣∣ < β ) = 1. Exactly the probability of x ∈ TNβ . Recall definition: for random variables v1, v2, . . ., we say vN → v in probability if for all β > 0 limN→∞ P(|vN − v | > β) = 0
Here vN corresponds to − 1N log2 P(x1, . . . , xN).
27 / 31
Asymptotic Equipartition Property
Comments
Why is it surprising/significant?
For an ensemble with binary outcomes, and low entropy,
|TNβ| ≤ 2NH(X)+β � 2N
i.e. the typical set is a small fraction of all possible sequences
AEP says that for N sufficiently large, we are virtually guaranteed to draw
a sequence from this small set
Significance in information theory
28 / 31
Asymptotic Equipartition Property
Proof
Since x1, . . . , xN are independent,
− 1
N
log p(x1, . . . , xN) = −
1
n
log
N∏
n=1
p(xn)
= − 1
N
N∑
n=1
log p(xn).
Let Y = − log p(X ) and yn = − log p(xn). Then, yn ∼ Y , and
E[Y ] = H(X ).
But then by the law of large numbers,
(∀β > 0) lim
N→∞
p
(∣∣∣∣∣ 1N
N∑
n=1
yn − H(X )
∣∣∣∣∣ > β
)
= 0.
29 / 31
1 Ensembles and sequences
Counting Types of Sequences
2 Typical sets
3 Asymptotic Equipartition Property (AEP)
4 Wrapping Up
30 / 31
Summary & Conclusions
Ensembles and sequences
Typical sets
Asymptotic Equipartition Property (AEP)
Next: Source Coding.
31 / 31
Ensembles and sequences
Counting Types of Sequences
Typical sets
Asymptotic Equipartition Property (AEP)
Wrapping Up