COMP2610 – Information Theory – Lecture 12: The Source Coding Theorem
COMP2610 – Information Theory
Lecture 12: The Source Coding Theorem
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
28 August 2018
1 / 27
Last time
Basic goal of compression
Key concepts: codes and their types, raw bit content, essential bit content
Informal statement of source coding theorem
2 / 27
A General Communication Game (Recap)
Data compression is the process of replacing a message with a smaller
message which can be reliably converted back to the original.
Want small messages on average when outcomes are from a fixed,
known, but uncertain source (e.g., coin flips with known bias)
Sender Receiver
Coding Decoding
010
Message
Heads, Tails, Heads, … Heads, Tails, Heads, …
3 / 27
Definitions (Recap)
Source Code
Given an ensemble X , the function c : AX → B is a source code for X .
The number of symbols in c(x) is the length l(x) of the codeword for x .
The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn)
Smallest δ-sufficient subset
Let X be an ensemble and for δ ≥ 0 define Sδ to be the smallest subset of
AX such that
P(x ∈ Sδ) ≥ 1− δ
Essential Bit Content
Let X be an ensemble then for δ ≥ 0 the essential bit content of X is
Hδ(X )
def
= log2 |Sδ|
4 / 27
Definitions (Recap)
Source Code
Given an ensemble X , the function c : AX → B is a source code for X .
The number of symbols in c(x) is the length l(x) of the codeword for x .
The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn)
Smallest δ-sufficient subset
Let X be an ensemble and for δ ≥ 0 define Sδ to be the smallest subset of
AX such that
P(x ∈ Sδ) ≥ 1− δ
Essential Bit Content
Let X be an ensemble then for δ ≥ 0 the essential bit content of X is
Hδ(X )
def
= log2 |Sδ|
4 / 27
Definitions (Recap)
Source Code
Given an ensemble X , the function c : AX → B is a source code for X .
The number of symbols in c(x) is the length l(x) of the codeword for x .
The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn)
Smallest δ-sufficient subset
Let X be an ensemble and for δ ≥ 0 define Sδ to be the smallest subset of
AX such that
P(x ∈ Sδ) ≥ 1− δ
Essential Bit Content
Let X be an ensemble then for δ ≥ 0 the essential bit content of X is
Hδ(X )
def
= log2 |Sδ|
4 / 27
Essential Bit Content (Recap)
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
5 / 27
Essential Bit Content (Recap)
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
5 / 27
Essential Bit Content (Recap)
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
5 / 27
Essential Bit Content (Recap)
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
5 / 27
Lossy Coding (Recap)
Consider a coin with P(Heads) = 0.9
If we are happy to fail on up to 2% of the sequences we can ignore any
sequence of 10 outcomes with more than 3 tails
There are only 176 < 28 sequences with 3 or fewer tails
So, we can just code those, and ignore the rest!
Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8
bits/outcome
6 / 27
This time
Recap: typical sets
Formal statement of source coding theorem
Proof of source coding theorem
7 / 27
The Source Coding Theorem
(Theorem 4.1 in MacKay)
Our aim this week is to understand this:
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
In English:
Given outcomes drawn from X . . .
. . . no matter what reliability 1− δ and tolerance � you choose . . .
. . . there is always a length N0 so sequences X N longer than this . . .
. . . have an average essential bit content 1N Hδ(X
N) within � of H(X )
Hδ(X N) measures the fewest number of bits needed to uniformly code
smallest set of N-outcome sequence Sδ with P(x ∈ Sδ) ≥ 1− δ.
8 / 27
The Source Coding Theorem
(Theorem 4.1 in MacKay)
Our aim this week is to understand this:
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
In English:
Given outcomes drawn from X . . .
. . . no matter what reliability 1− δ and tolerance � you choose . . .
. . . there is always a length N0 so sequences X N longer than this . . .
. . . have an average essential bit content 1N Hδ(X
N) within � of H(X )
Hδ(X N) measures the fewest number of bits needed to uniformly code
smallest set of N-outcome sequence Sδ with P(x ∈ Sδ) ≥ 1− δ.
8 / 27
The Source Coding Theorem
(Theorem 4.1 in MacKay)
Our aim this week is to understand this:
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
In English:
Given outcomes drawn from X . . .
. . . no matter what reliability 1− δ and tolerance � you choose . . .
. . . there is always a length N0 so sequences X N longer than this . . .
. . . have an average essential bit content 1N Hδ(X
N) within � of H(X )
Hδ(X N) measures the fewest number of bits needed to uniformly code
smallest set of N-outcome sequence Sδ with P(x ∈ Sδ) ≥ 1− δ.
8 / 27
The Source Coding Theorem
(Theorem 4.1 in MacKay)
Our aim this week is to understand this:
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
In English:
Given outcomes drawn from X . . .
. . . no matter what reliability 1− δ and tolerance � you choose . . .
. . . there is always a length N0 so sequences X N longer than this . . .
. . . have an average essential bit content 1N Hδ(X
N) within � of H(X )
Hδ(X N) measures the fewest number of bits needed to uniformly code
smallest set of N-outcome sequence Sδ with P(x ∈ Sδ) ≥ 1− δ.
8 / 27
The Source Coding Theorem
(Theorem 4.1 in MacKay)
Our aim this week is to understand this:
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
In English:
Given outcomes drawn from X . . .
. . . no matter what reliability 1− δ and tolerance � you choose . . .
. . . there is always a length N0 so sequences X N longer than this . . .
. . . have an average essential bit content 1N Hδ(X
N) within � of H(X )
Hδ(X N) measures the fewest number of bits needed to uniformly code
smallest set of N-outcome sequence Sδ with P(x ∈ Sδ) ≥ 1− δ.
8 / 27
The Source Coding Theorem
(Theorem 4.1 in MacKay)
Our aim this week is to understand this:
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �. In English: Given outcomes drawn from X . . . . . . no matter what reliability 1− δ and tolerance � you choose . . . . . . there is always a length N0 so sequences X N longer than this . . . . . . have an average essential bit content 1N Hδ(X N) within � of H(X ) Hδ(X N) measures the fewest number of bits needed to uniformly code smallest set of N-outcome sequence Sδ with P(x ∈ Sδ) ≥ 1− δ. 8 / 27 1 Introduction Quick Review 2 Extended Ensembles Defintion and Properties Essential Bit Content The Asymptotic Equipartition Property 3 The Source Coding Theorem Typical Sets Statement of the Theorem 9 / 27 Extended Ensembles (Review) Instead of coding single outcomes, we now consider coding blocks and sequences of blocks Example (Coin Flips): hhhhthhththh→ hh hh th ht ht hh (6 × 2 outcome blocks) → hhh hth hth thh (4 × 3 outcome blocks) → hhhh thht hthh (3 × 4 outcome blocks) Extended Ensemble The extended ensemble of blocks of size N is denoted X N . Outcomes from X N are denoted x = (x1, x2, . . . , xN). The probability of x is defined to be P(x) = P(x1)P(x2) . . .P(xN). What is the entropy of X N? 10 / 27 Extended Ensembles (Review) Instead of coding single outcomes, we now consider coding blocks and sequences of blocks Example (Coin Flips): hhhhthhththh→ hh hh th ht ht hh (6 × 2 outcome blocks) → hhh hth hth thh (4 × 3 outcome blocks) → hhhh thht hthh (3 × 4 outcome blocks) Extended Ensemble The extended ensemble of blocks of size N is denoted X N . Outcomes from X N are denoted x = (x1, x2, . . . , xN). The probability of x is defined to be P(x) = P(x1)P(x2) . . .P(xN). What is the entropy of X N? 10 / 27 Extended Ensembles (Review) Instead of coding single outcomes, we now consider coding blocks and sequences of blocks Example (Coin Flips): hhhhthhththh→ hh hh th ht ht hh (6 × 2 outcome blocks) → hhh hth hth thh (4 × 3 outcome blocks) → hhhh thht hthh (3 × 4 outcome blocks) Extended Ensemble The extended ensemble of blocks of size N is denoted X N . Outcomes from X N are denoted x = (x1, x2, . . . , xN). The probability of x is defined to be P(x) = P(x1)P(x2) . . .P(xN). What is the entropy of X N? 10 / 27 Extended Ensembles (Review) Example: Bent Coin Let X be an ensemble with outcomes AX = {h, t} with ph = 0.9 and pt = 0.1. Consider X 4 – i.e., 4 flips of the coin. AX 4 = {hhhh, hhht, hhth, . . . , tttt} What is the probability of Four heads? P(hhhh) = (0.9)4 ≈ 0.656 Four tails? P(tttt) = (0.1)4 = 0.0001 What is the entropy and raw bit content of X 4? The outcome set size is |AX 4 | = |{0000, 0001, 0010, . . . , 1111}| = 16 Raw bit content: H0(X 4) = log2 |AX 4 | = 4 Entropy: H(X 4) = 4H(X ) = 4. (−0.9 log2 0.9− 0.1 log2 0.1) = 1.88 11 / 27 Extended Ensembles (Review) Example: Bent Coin Let X be an ensemble with outcomes AX = {h, t} with ph = 0.9 and pt = 0.1. Consider X 4 – i.e., 4 flips of the coin. AX 4 = {hhhh, hhht, hhth, . . . , tttt} What is the probability of Four heads? P(hhhh) = (0.9)4 ≈ 0.656 Four tails? P(tttt) = (0.1)4 = 0.0001 What is the entropy and raw bit content of X 4? The outcome set size is |AX 4 | = |{0000, 0001, 0010, . . . , 1111}| = 16 Raw bit content: H0(X 4) = log2 |AX 4 | = 4 Entropy: H(X 4) = 4H(X ) = 4. (−0.9 log2 0.9− 0.1 log2 0.1) = 1.88 11 / 27 Extended Ensembles (Review) Example: Bent Coin Let X be an ensemble with outcomes AX = {h, t} with ph = 0.9 and pt = 0.1. Consider X 4 – i.e., 4 flips of the coin. AX 4 = {hhhh, hhht, hhth, . . . , tttt} What is the probability of Four heads? P(hhhh) = (0.9)4 ≈ 0.656 Four tails? P(tttt) = (0.1)4 = 0.0001 What is the entropy and raw bit content of X 4? The outcome set size is |AX 4 | = |{0000, 0001, 0010, . . . , 1111}| = 16 Raw bit content: H0(X 4) = log2 |AX 4 | = 4 Entropy: H(X 4) = 4H(X ) = 4. (−0.9 log2 0.9− 0.1 log2 0.1) = 1.88 11 / 27 Essential Bit Content of Extended Ensembles What if we use a lossy uniform code on the extended ensemble? x P(x) x P(x) hhhh 0.656 thht 0.008 hhht 0.073 thth 0.008 hhth 0.073 tthh 0.008 hthh 0.073 httt 0.001 thhh 0.073 thtt 0.001 htht 0.008 ttht 0.001 htth 0.008 ttth 0.001 hhtt 0.008 tttt 0.000 δ = 0 gives Hδ ( X 4 ) = log2 16 = 4 12 / 27 Essential Bit Content of Extended Ensembles What if we use a lossy uniform code on the extended ensemble? x P(x) x P(x) hhhh 0.656 thht 0.008 hhht 0.073 thth 0.008 hhth 0.073 tthh 0.008 hthh 0.073 httt 0.001 thhh 0.073 thtt 0.001 htht 0.008 ttht 0.001 htth 0.008 ttth 0.001 hhtt 0.008 tttt 0.000 δ = 0.0001 gives Hδ ( X 4 ) = log2 15 = 3.91 12 / 27 Essential Bit Content of Extended Ensembles What if we use a lossy uniform code on the extended ensemble? x P(x) x P(x) hhhh 0.656 thht 0.008 hhht 0.073 thth 0.008 hhth 0.073 tthh 0.008 hthh 0.073 httt 0.001 thhh 0.073 thtt 0.001 htht 0.008 ttht 0.001 htth 0.008 ttth 0.001 hhtt 0.008 tttt 0.000 δ = 0.005 gives Hδ ( X 4 ) = log2 11 = 3.46 12 / 27 Essential Bit Content of Extended Ensembles What if we use a lossy uniform code on the extended ensemble? x P(x) x P(x) hhhh 0.656 thht 0.008 hhht 0.073 thth 0.008 hhth 0.073 tthh 0.008 hthh 0.073 httt 0.001 thhh 0.073 thtt 0.001 htht 0.008 ttht 0.001 htth 0.008 ttth 0.001 hhtt 0.008 tttt 0.000 δ = 0.05 gives Hδ ( X 4 ) = log2 5 = 2.32 12 / 27 Essential Bit Content of Extended Ensembles What if we use a lossy uniform code on the extended ensemble? x P(x) x P(x) hhhh 0.656 thht 0.008 hhht 0.073 thth 0.008 hhth 0.073 tthh 0.008 hthh 0.073 httt 0.001 thhh 0.073 thtt 0.001 htht 0.008 ttht 0.001 htth 0.008 ttth 0.001 hhtt 0.008 tttt 0.000 δ = 0.25 gives Hδ ( X 4 ) = log2 3 = 1.6 12 / 27 Essential Bit Content of Extended Ensembles What if we use a lossy uniform code on the extended ensemble? x P(x) x P(x) hhhh 0.656 thht 0.008 hhht 0.073 thth 0.008 hhth 0.073 tthh 0.008 hthh 0.073 httt 0.001 thhh 0.073 thtt 0.001 htht 0.008 ttht 0.001 htth 0.008 ttth 0.001 hhtt 0.008 tttt 0.000 δ = 0.25 gives Hδ ( X 4 ) = log2 3 = 1.6 Unlike entropy, Hδ(X 4) 6= 4Hδ(X ) = 0 12 / 27 Essential Bit Content of Extended Ensembles What happens as N increases? 1 N H!(XN) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 N=10 N=210 N=410 N=610 N=810 N=1010 ! Recall that the entropy of a single coin flip with ph = 0.9 is H(X ) ≈ 0.47 13 / 27 Essential Bit Content of Extended Ensembles Some Intuition Why does the curve flatten for large N? Recall that for N = 1000 e.g., sequences with 900 heads are considered typical Such sequences occupy most of the probability mass, and are roughly equally likely As we increase δ, we will quickly encounter these sequences, and make small, roughly equal sized changes to |Sδ| 14 / 27 Typical Sets and the AEP (Review) ✦ ❧�❣✷ P✭①✮ ✁◆❍✭❳✮ ❚✂✄ 15 / 27 Typical Sets and the AEP (Review) Typical Set For “closeness” β > 0 the typical set TNβ for X N is
TNβ
def
=
{
x :
∣∣∣∣−
1
N
log2 P(x)− H(X )
∣∣∣∣ < β
}
The name “typical” is used since x ∈ TNβ will have roughly p1N
occurences of symbol a1, p2N of a2, . . ., pK N of aK .
Asymptotic Equipartition Property (Informal)
As N →∞, log2 P(x1, . . . , xN) is close to −NH(X ) with high probability.
For large block sizes “almost all sequences are typical” (i.e., in TNβ).
16 / 27
Typical Sets and the AEP (Review)
Typical Set
For “closeness” β > 0 the typical set TNβ for X N is
TNβ
def
=
{
x :
∣∣∣∣−
1
N
log2 P(x)− H(X )
∣∣∣∣ < β
}
The name “typical” is used since x ∈ TNβ will have roughly p1N
occurences of symbol a1, p2N of a2, . . ., pK N of aK .
Asymptotic Equipartition Property (Informal)
As N →∞, log2 P(x1, . . . , xN) is close to −NH(X ) with high probability.
For large block sizes “almost all sequences are typical” (i.e., in TNβ).
16 / 27
1 Introduction
Quick Review
2 Extended Ensembles
Defintion and Properties
Essential Bit Content
The Asymptotic Equipartition Property
3 The Source Coding Theorem
Typical Sets
Statement of the Theorem
17 / 27
The Source Coding Theorem
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �. 1 N H!(XN) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 N=10 N=210 N=410 N=610 N=810 N=1010 ! Given a tiny probability of error δ, the average bits per outcome can be made as close to H as required. Even if we allow a large probability of error, we cannot compress more than H bits per outcome for large sequences. 18 / 27 Warning: proof ahead � I don’t expect you to reproduce the following proof I present it as it sheds some light on why the result is true And it is a remarkable and fundamental result You are expected to understand and be able to apply the theorem 19 / 27 Proof of the SCT The absolute value of a difference being bounded (e.g., |x − y | ≤ �) says two things: 1 When x − y is positive, it says x − y < � which means x < y + � 2 When x − y is negative, it says −(x − y) < � which means x < y − � |x − y | < � is equivalent to y − � < x < y + � Using this, we break down the claim of the SCT into two parts: showing that for any � and δ we can find N large enough so that: Part 1: 1N Hδ(X N) < H + � Part 2: 1N Hδ(X N) > H − �
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
82 4 — The Source Coding Theorem
Chebyshev’s inequality 2. Let x be a random variable, and let α be a
positive real number. Then
P
!
(x − x̄)2 ≥ α
”
≤ σ2x/α. (4.31)
Proof: Take t = (x − x̄)2 and apply the previous proposition. ✷
Weak law of large numbers. Take x to be the average of N independent
random variables h1, . . . , hN , having common mean h̄ and common vari-
ance σ2h: x =
1
N
#N
n=1 hn. Then
P ((x − h̄)2 ≥ α) ≤ σ2h/αN. (4.32)
Proof: obtained by showing that x̄ = h̄ and that σ2x = σ
2
h/N . ✷
We are interested in x being very close to the mean (α very small). No matter
how large σ2h is, and no matter how small the required α is, and no matter
how small the desired probability that (x − h̄)2 ≥ α, we can always achieve it
by taking N large enough.
Proof of theorem 4.1 (p.78)
We apply the law of large numbers to the random variable 1
N
log2
1
P (x)
defined
for x drawn from the ensemble XN . This random variable can be written as
the average of N information contents hn = log2(1/P (xn)), each of which is a
random variable with mean H = H(X) and variance σ2 ≡ var[log2(1/P (xn))].
(Each term hn is the Shannon information content of the nth outcome.)
We again define the typical set with parameters N and β thus:
TNβ =
$
x ∈ ANX :
%
1
N
log2
1
P (x)
− H
&2
< β2
'
. (4.33)
For all x ∈ TNβ, the probability of x satisfies
2−N(H+β) < P (x) < 2−N(H−β). (4.34)
And by the law of large numbers,
P (x ∈ TNβ) ≥ 1 −
σ2
β2N
. (4.35)
We have thus proved the ‘asymptotic equipartition’ principle. As N increases,
the probability that x falls in TNβ approaches 1, for any β. How does this
result relate to source coding?
We must relate TNβ to Hδ(X
N ). We will show that for any given δ there
is a sufficiently big N such that Hδ(X
N ) ≃ NH.
Part 1: 1
N
Hδ(X
N ) < H + ϵ.
The set TNβ is not the best subset for compression. So the size of TNβ gives
an upper bound on Hδ. We show how small Hδ(X
N ) must be by calculating
how big TNβ could possibly be. We are free to set β to any convenient value.
The smallest possible probability that a member of TNβ can have is 2
−N(H+β),
and the total probability contained by TNβ can’t be any bigger than 1. So
|TNβ | 2−N(H+β) < 1, (4.36)
that is, the size of the typical set is bounded by
|TNβ | < 2N(H+β). (4.37)
If we set β = ϵ and N0 such that
σ2
ϵ2N0
≤ δ, then P (TNβ) ≥ 1 − δ, and the set
TNβ becomes a witness to the fact that Hδ(X
N ) ≤ log2 |TNβ | < N(H + ϵ).
1
N
Hδ(X
N )
H0(X)
0 1 δ
H − ϵ
H
H + ϵ
Figure 4.13. Schematic illustration
of the two parts of the theorem.
Given any δ and ϵ, we show that
for large enough N , 1
N
Hδ(XN)
lies (1) below the line H + ϵ and
(2) above the line H − ϵ.
20 / 27
Proof of the SCT
The absolute value of a difference being bounded (e.g., |x − y | ≤ �) says
two things:
1 When x − y is positive, it says x − y < � which means x < y + �
2 When x − y is negative, it says −(x − y) < � which means x < y − �
|x − y | < � is equivalent to y − � < x < y + �
Using this, we break down the claim of the SCT into two parts: showing
that for any � and δ we can find N large enough so that:
Part 1: 1N Hδ(X
N) < H + �
Part 2: 1N Hδ(X
N) > H − �
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
82 4 — The Source Coding Theorem
Chebyshev’s inequality 2. Let x be a random variable, and let α be a
positive real number. Then
P
!
(x − x̄)2 ≥ α
”
≤ σ2x/α. (4.31)
Proof: Take t = (x − x̄)2 and apply the previous proposition. ✷
Weak law of large numbers. Take x to be the average of N independent
random variables h1, . . . , hN , having common mean h̄ and common vari-
ance σ2h: x =
1
N
#N
n=1 hn. Then
P ((x − h̄)2 ≥ α) ≤ σ2h/αN. (4.32)
Proof: obtained by showing that x̄ = h̄ and that σ2x = σ
2
h/N . ✷
We are interested in x being very close to the mean (α very small). No matter
how large σ2h is, and no matter how small the required α is, and no matter
how small the desired probability that (x − h̄)2 ≥ α, we can always achieve it
by taking N large enough.
Proof of theorem 4.1 (p.78)
We apply the law of large numbers to the random variable 1
N
log2
1
P (x)
defined
for x drawn from the ensemble XN . This random variable can be written as
the average of N information contents hn = log2(1/P (xn)), each of which is a
random variable with mean H = H(X) and variance σ2 ≡ var[log2(1/P (xn))].
(Each term hn is the Shannon information content of the nth outcome.)
We again define the typical set with parameters N and β thus:
TNβ =
$
x ∈ ANX :
%
1
N
log2
1
P (x)
− H
&2
< β2
'
. (4.33)
For all x ∈ TNβ, the probability of x satisfies
2−N(H+β) < P (x) < 2−N(H−β). (4.34)
And by the law of large numbers,
P (x ∈ TNβ) ≥ 1 −
σ2
β2N
. (4.35)
We have thus proved the ‘asymptotic equipartition’ principle. As N increases,
the probability that x falls in TNβ approaches 1, for any β. How does this
result relate to source coding?
We must relate TNβ to Hδ(X
N ). We will show that for any given δ there
is a sufficiently big N such that Hδ(X
N ) ≃ NH.
Part 1: 1
N
Hδ(X
N ) < H + ϵ.
The set TNβ is not the best subset for compression. So the size of TNβ gives
an upper bound on Hδ. We show how small Hδ(X
N ) must be by calculating
how big TNβ could possibly be. We are free to set β to any convenient value.
The smallest possible probability that a member of TNβ can have is 2
−N(H+β),
and the total probability contained by TNβ can’t be any bigger than 1. So
|TNβ | 2−N(H+β) < 1, (4.36)
that is, the size of the typical set is bounded by
|TNβ | < 2N(H+β). (4.37)
If we set β = ϵ and N0 such that
σ2
ϵ2N0
≤ δ, then P (TNβ) ≥ 1 − δ, and the set
TNβ becomes a witness to the fact that Hδ(X
N ) ≤ log2 |TNβ | < N(H + ϵ).
1
N
Hδ(X
N )
H0(X)
0 1 δ
H − ϵ
H
H + ϵ
Figure 4.13. Schematic illustration
of the two parts of the theorem.
Given any δ and ϵ, we show that
for large enough N , 1
N
Hδ(XN)
lies (1) below the line H + ϵ and
(2) above the line H − ϵ.
20 / 27
Proof the SCT
Idea
Proof Idea: As N increases
TNβ has ∼ 2NH(X) elements
almost all x are in TNβ
Sδ and TNβ increasingly overlap
so log2 |Sδ| ∼ NH
Basically, we look to encode all typical sequences uniformly, and relate
that to the essential bit content
21 / 27
Proof of the SCT (Part 1)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) < H(X ) + �.
Recall (see Lecture 10) for the typical set TNβ we have for any N, β that
|TNβ| ≤ 2N(H(X)+β) (1)
and, by the AEP, for any β as N →∞ we have P(x ∈ TNβ)→ 1.
So for any δ > 0 we can always find an N such that P(x ∈ TNβ) ≥ 1− δ.
Now recall the definition of the smallest δ-sufficient subset Sδ: it is the
smallest subset of outcomes such that P(x ∈ Sδ) ≥ 1− δ so |Sδ| ≤ |TNβ|.
So, given any δ and β we can find an N large enough so that, by (1)
22 / 27
Proof of the SCT (Part 1)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) < H(X ) + �.
Recall (see Lecture 10) for the typical set TNβ we have for any N, β that
|TNβ| ≤ 2N(H(X)+β) (1)
and, by the AEP, for any β as N →∞ we have P(x ∈ TNβ)→ 1.
So for any δ > 0 we can always find an N such that P(x ∈ TNβ) ≥ 1− δ.
Now recall the definition of the smallest δ-sufficient subset Sδ: it is the
smallest subset of outcomes such that P(x ∈ Sδ) ≥ 1− δ so |Sδ| ≤ |TNβ|.
So, given any δ and β we can find an N large enough so that, by (1)
22 / 27
Proof of the SCT (Part 1)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) < H(X ) + �.
Recall (see Lecture 10) for the typical set TNβ we have for any N, β that
|TNβ| ≤ 2N(H(X)+β) (1)
and, by the AEP, for any β as N →∞ we have P(x ∈ TNβ)→ 1.
So for any δ > 0 we can always find an N such that P(x ∈ TNβ) ≥ 1− δ.
Now recall the definition of the smallest δ-sufficient subset Sδ: it is the
smallest subset of outcomes such that P(x ∈ Sδ) ≥ 1− δ so |Sδ| ≤ |TNβ|.
So, given any δ and β we can find an N large enough so that, by (1)
22 / 27
Proof of the SCT (Part 1)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) < H(X ) + �.
Recall (see Lecture 10) for the typical set TNβ we have for any N, β that
|TNβ| ≤ 2N(H(X)+β) (1)
and, by the AEP, for any β as N →∞ we have P(x ∈ TNβ)→ 1.
So for any δ > 0 we can always find an N such that P(x ∈ TNβ) ≥ 1− δ.
Now recall the definition of the smallest δ-sufficient subset Sδ: it is the
smallest subset of outcomes such that P(x ∈ Sδ) ≥ 1− δ so |Sδ| ≤ |TNβ|.
So, given any δ and β we can find an N large enough so that, by (1)
|Sδ| ≤ |TNβ| ≤ 2N(H(X)+β)
22 / 27
Proof of the SCT (Part 1)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) < H(X ) + �.
Recall (see Lecture 10) for the typical set TNβ we have for any N, β that
|TNβ| ≤ 2N(H(X)+β) (1)
and, by the AEP, for any β as N →∞ we have P(x ∈ TNβ)→ 1.
So for any δ > 0 we can always find an N such that P(x ∈ TNβ) ≥ 1− δ.
Now recall the definition of the smallest δ-sufficient subset Sδ: it is the
smallest subset of outcomes such that P(x ∈ Sδ) ≥ 1− δ so |Sδ| ≤ |TNβ|.
So, given any δ and β we can find an N large enough so that, by (1)
log2 |Sδ| ≤ log2 |TNβ| ≤ N(H(X ) + β)
22 / 27
Proof of the SCT (Part 1)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) < H(X ) + �.
Recall (see Lecture 10) for the typical set TNβ we have for any N, β that
|TNβ| ≤ 2N(H(X)+β) (1)
and, by the AEP, for any β as N →∞ we have P(x ∈ TNβ)→ 1.
So for any δ > 0 we can always find an N such that P(x ∈ TNβ) ≥ 1− δ.
Now recall the definition of the smallest δ-sufficient subset Sδ: it is the
smallest subset of outcomes such that P(x ∈ Sδ) ≥ 1− δ so |Sδ| ≤ |TNβ|.
So, given any δ and β we can find an N large enough so that, by (1)
Hδ(X
N) = log2 |Sδ| ≤ log2 |TNβ| ≤ N(H(X ) + β)
Setting β = � and dividing through by N gives result.
22 / 27
Proof of the SCT (Part 2)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) > H(X )− �.
Suppose this was not the case – that is, for every N we have
1
N
Hδ(X
N) ≤ H(X )− � ⇐⇒ |Sδ| ≤ 2N(H(X)−�)
Let’s look at what this says about P(x ∈ Sδ) by writing
P(x ∈ Sδ) = P(x ∈ Sδ ∩ TNβ) + P(x ∈ Sδ ∩ TNβ)
≤ |Sδ|2−N(H−β) + P(x ∈ TNβ)
since every x ∈ TNβ has P(x) ≤ 2−N(H−β) and Sδ ∩ TNβ ⊂ TNβ .
So
P(x ∈ Sδ) ≤ 2N(H−�)2−N(H−β) + P(x ∈ TNβ)
23 / 27
Proof of the SCT (Part 2)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) > H(X )− �.
Suppose this was not the case – that is, for every N we have
1
N
Hδ(X
N) ≤ H(X )− � ⇐⇒ |Sδ| ≤ 2N(H(X)−�)
Let’s look at what this says about P(x ∈ Sδ) by writing
P(x ∈ Sδ) = P(x ∈ Sδ ∩ TNβ) + P(x ∈ Sδ ∩ TNβ)
≤ |Sδ|2−N(H−β) + P(x ∈ TNβ)
since every x ∈ TNβ has P(x) ≤ 2−N(H−β) and Sδ ∩ TNβ ⊂ TNβ .
So
P(x ∈ Sδ) ≤ 2N(H−�)2−N(H−β) + P(x ∈ TNβ)
23 / 27
Proof of the SCT (Part 2)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) > H(X )− �.
Suppose this was not the case – that is, for every N we have
1
N
Hδ(X
N) ≤ H(X )− � ⇐⇒ |Sδ| ≤ 2N(H(X)−�)
Let’s look at what this says about P(x ∈ Sδ) by writing
P(x ∈ Sδ) = P(x ∈ Sδ ∩ TNβ) + P(x ∈ Sδ ∩ TNβ)
≤ |Sδ|2−N(H−β) + P(x ∈ TNβ)
since every x ∈ TNβ has P(x) ≤ 2−N(H−β) and Sδ ∩ TNβ ⊂ TNβ .
So
P(x ∈ Sδ) ≤ 2N(H−�)2−N(H−β) + P(x ∈ TNβ)
23 / 27
Proof of the SCT (Part 2)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) > H(X )− �.
Suppose this was not the case – that is, for every N we have
1
N
Hδ(X
N) ≤ H(X )− � ⇐⇒ |Sδ| ≤ 2N(H(X)−�)
Let’s look at what this says about P(x ∈ Sδ) by writing
P(x ∈ Sδ) = P(x ∈ Sδ ∩ TNβ) + P(x ∈ Sδ ∩ TNβ)
≤ |Sδ|2−N(H−β) + P(x ∈ TNβ)
since every x ∈ TNβ has P(x) ≤ 2−N(H−β) and Sδ ∩ TNβ ⊂ TNβ .
So
P(x ∈ Sδ) ≤ 2−N(H−H+�−β) + P(x ∈ TNβ)
23 / 27
Proof of the SCT (Part 2)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) > H(X )− �.
Suppose this was not the case – that is, for every N we have
1
N
Hδ(X
N) ≤ H(X )− � ⇐⇒ |Sδ| ≤ 2N(H(X)−�)
Let’s look at what this says about P(x ∈ Sδ) by writing
P(x ∈ Sδ) = P(x ∈ Sδ ∩ TNβ) + P(x ∈ Sδ ∩ TNβ)
≤ |Sδ|2−N(H−β) + P(x ∈ TNβ)
since every x ∈ TNβ has P(x) ≤ 2−N(H−β) and Sδ ∩ TNβ ⊂ TNβ .
So
P(x ∈ Sδ) ≤ 2−N(�−β) + P(x ∈ TNβ)→ 0 as N →∞
since P(x ∈ TNβ)→ 1.
But P(x ∈ Sδ) ≥ 1− δ, by defn. Contradiction
23 / 27
Proof of the SCT (Part 2)
For � > 0 and δ > 0, want N large enough so 1N Hδ(X
N) > H(X )− �.
Suppose this was not the case – that is, for every N we have
1
N
Hδ(X
N) ≤ H(X )− � ⇐⇒ |Sδ| ≤ 2N(H(X)−�)
Let’s look at what this says about P(x ∈ Sδ) by writing
P(x ∈ Sδ) = P(x ∈ Sδ ∩ TNβ) + P(x ∈ Sδ ∩ TNβ)
≤ |Sδ|2−N(H−β) + P(x ∈ TNβ)
since every x ∈ TNβ has P(x) ≤ 2−N(H−β) and Sδ ∩ TNβ ⊂ TNβ .
So
P(x ∈ Sδ) ≤ 2−N(�−β) + P(x ∈ TNβ)→ 0 as N →∞
since P(x ∈ TNβ)→ 1. But P(x ∈ Sδ) ≥ 1− δ, by defn. Contradiction
23 / 27
Interpretation of the SCT
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
If you want to uniformly code blocks of N symbols drawn i.i.d. from X
If you use more than NH(X ) bits per block you can do so without
almost no loss of information as N →∞
If you use less than NH(X ) bits per block you will almost certainly
lose information as N →∞
24 / 27
Interpretation of the SCT
The Source Coding Theorem
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �. Making the error probability δ ≈ 1 doesn’t really help We’re still “stuck with” coding the typical sequences Assumes we deal with X N If outcomes are dependent, entropy H(X ) need not be the limit We won’t look at such extensions 25 / 27 Implications of SCT How practical is it to perform coding inspired by the SCT? Not very! Theorem might require huge block sizes N0 We’d need lookup tables of size |Sδ(X N0)| ∼ 2N0·H(X) Can we design more practical compression algorithms? And will the entropy still feature with the fundamental limit? 26 / 27 Implications of SCT How practical is it to perform coding inspired by the SCT? Not very! Theorem might require huge block sizes N0 We’d need lookup tables of size |Sδ(X N0)| ∼ 2N0·H(X) Can we design more practical compression algorithms? And will the entropy still feature with the fundamental limit? 26 / 27 Implications of SCT How practical is it to perform coding inspired by the SCT? Not very! Theorem might require huge block sizes N0 We’d need lookup tables of size |Sδ(X N0)| ∼ 2N0·H(X) Can we design more practical compression algorithms? And will the entropy still feature with the fundamental limit? 26 / 27 Next time We move towards more practical compression ideas Prefix and Uniquely Decodeable variable-length codes The Kraft Inequality 27 / 27 Introduction Quick Review Extended Ensembles Defintion and Properties Essential Bit Content The Asymptotic Equipartition Property The Source Coding Theorem Typical Sets Statement of the Theorem