COMP2610 – Information Theory – Lecture 11: Entropy and Coding
COMP2610 – Information Theory
Lecture 11: Entropy and Coding
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
27 August 2018
1 / 28
Brief Recap of Course (Last 6 Weeks)
How can we quantify information?
I Basic Definitions and Key Concepts
I Probability, Entropy & Information
How can we make good guesses?
I Probabilistic Inference
I Bayes Theorem
How much redundancy can we safely remove?
I Compression
I Source Coding Theorem, Kraft Inequality
I Block, Huffman, and Lempel-Ziv Coding
How much noise can we correct and how?
I Noisy-Channel Coding
I Repetition Codes, Hamming Codes
What is randomness?
I Kolmogorov Complexity
I Algorithmic Information Theory
2 / 28
Brief Overview of Course (Next 6 Weeks)
How can we quantify information?
I Basic Definitions and Key Concepts
I Probability, Entropy & Information
How can we make good guesses?
I Probabilistic Inference
I Bayes Theorem
How much redundancy can we safely remove?
I Compression
I Source Coding Theorem, Kraft Inequality
I Block, Huffman, and Lempel-Ziv Coding
How much noise can we correct and how?
I Noisy-Channel Coding
I Repetition Codes, Hamming Codes
What is randomness?
I Kolmogorov Complexity
I Algorithmic Information Theory
3 / 28
Brief Overview of Course (Next 6 Weeks)
How can we quantify information?
I Basic Definitions and Key Concepts
I Probability, Entropy & Information
How can we make good guesses?
I Probabilistic Inference
I Bayes Theorem
How much redundancy can we safely remove?
I Compression
I Source Coding Theorem, Kraft Inequality
I Block, Huffman, and Lempel-Ziv Coding
How much noise can we correct and how?
I Noisy-Channel Coding
I Repetition Codes, Hamming Codes
What is randomness?
I Kolmogorov Complexity
I Algorithmic Information Theory
3 / 28
This time
Basic goal of compression
Key concepts: codes and their types, raw bit content, essential bit content
Informal statement of source coding theorem
4 / 28
1 Introduction
Overview
What is Compression?
A Communication Game
What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review
Defining Codes
3 Formalising Compression
Reliability vs. Size
Key Result: The Source Coding Theorem
5 / 28
What is Compression?
Cn y rd ths mssg wtht ny vwls?
It is not too difficult to read as there is redundancy in English text.
(Estimates of 1-1.5 bits per character, compared to log2 26 ≈ 4.7)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
24 2 — Probability, Entropy, and Inference
abcdefghijklmnopqrstuvwxyz– y
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
x
abcdefghijklmnopqrstuvwxyz– y
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
x
(a) P (y | x) (b) P (x | y)
Figure 2.3. Conditional
probability distributions. (a)
P (y |x): Each row shows the
conditional distribution of the
second letter, y, given the first
letter, x, in a bigram xy. (b)
P (x | y): Each column shows the
conditional distribution of the
first letter, x, given the second
letter, y.
that the first letter x is q are u and -. (The space is common after q
because the source document makes heavy use of the word FAQ.)
The probability P (x | y =u) is the probability distribution of the first
letter x given that the second letter y is a u. As you can see in figure 2.3b
the two most probable values for x given y =u are n and o.
Rather than writing down the joint probability directly, we often define an
ensemble in terms of a collection of conditional probabilities. The following
rules of probability theory will be useful. (H denotes assumptions on which
the probabilities are based.)
Product rule – obtained from the definition of conditional probability:
P (x, y |H) = P (x | y,H)P (y |H) = P (y |x,H)P (x |H). (2.6)
This rule is also known as the chain rule.
Sum rule – a rewriting of the marginal probability definition:
P (x |H) =
!
y
P (x, y |H) (2.7)
=
!
y
P (x | y,H)P (y |H). (2.8)
Bayes’ theorem – obtained from the product rule:
P (y |x,H) = P (x | y,H)P (y |H)
P (x |H) (2.9)
=
P (x | y,H)P (y |H)”
y! P (x | y!,H)P (y! |H)
. (2.10)
Independence. Two random variables X and Y are independent (sometimes
written X!Y ) if and only if
P (x, y) = P (x)P (y). (2.11)
Exercise 2.2.[1, p.40] Are the random variables X and Y in the joint ensemble
of figure 2.2 independent?
If you see a “q”, it is very likely
to be followed with a “u”
The letter “e” is much more
common than “j”
Compression exploits
differences in relative probability
of symbols or blocks of symbols
6 / 28
What is Compression?
Cn y rd ths mssg wtht ny vwls?
It is not too difficult to read as there is redundancy in English text.
(Estimates of 1-1.5 bits per character, compared to log2 26 ≈ 4.7)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
24 2 — Probability, Entropy, and Inference
abcdefghijklmnopqrstuvwxyz– y
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
x
abcdefghijklmnopqrstuvwxyz– y
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
x
(a) P (y | x) (b) P (x | y)
Figure 2.3. Conditional
probability distributions. (a)
P (y |x): Each row shows the
conditional distribution of the
second letter, y, given the first
letter, x, in a bigram xy. (b)
P (x | y): Each column shows the
conditional distribution of the
first letter, x, given the second
letter, y.
that the first letter x is q are u and -. (The space is common after q
because the source document makes heavy use of the word FAQ.)
The probability P (x | y =u) is the probability distribution of the first
letter x given that the second letter y is a u. As you can see in figure 2.3b
the two most probable values for x given y =u are n and o.
Rather than writing down the joint probability directly, we often define an
ensemble in terms of a collection of conditional probabilities. The following
rules of probability theory will be useful. (H denotes assumptions on which
the probabilities are based.)
Product rule – obtained from the definition of conditional probability:
P (x, y |H) = P (x | y,H)P (y |H) = P (y |x,H)P (x |H). (2.6)
This rule is also known as the chain rule.
Sum rule – a rewriting of the marginal probability definition:
P (x |H) =
!
y
P (x, y |H) (2.7)
=
!
y
P (x | y,H)P (y |H). (2.8)
Bayes’ theorem – obtained from the product rule:
P (y |x,H) = P (x | y,H)P (y |H)
P (x |H) (2.9)
=
P (x | y,H)P (y |H)”
y! P (x | y!,H)P (y! |H)
. (2.10)
Independence. Two random variables X and Y are independent (sometimes
written X!Y ) if and only if
P (x, y) = P (x)P (y). (2.11)
Exercise 2.2.[1, p.40] Are the random variables X and Y in the joint ensemble
of figure 2.2 independent?
If you see a “q”, it is very likely
to be followed with a “u”
The letter “e” is much more
common than “j”
Compression exploits
differences in relative probability
of symbols or blocks of symbols
6 / 28
Compression in a Nutshell
Compression
Data compression is the process of replacing a message with a smaller
message which can be reliably converted back to the original.
7 / 28
A General Communication Game
Imagine the following game between Sender & Receiver:
Sender & Receiver agree on code for each outcome ahead of time
(e.g., 0 for Heads; 1 for Tails)
Sender observes outcomes then codes and sends message
Receiver decodes message and recovers outcome sequence
Sender Receiver
Coding Decoding
010
Message
Heads, Tails, Heads, … Heads, Tails, Heads, …
Goal: Want small messages on average when outcomes are from a fixed,
known, but uncertain source (e.g., coin flips with known bias)
8 / 28
Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission:
Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome
Not very interesting!
9 / 28
Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission:
Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome
Not very interesting!
Things get interesting if we:
accept errors in transmission
allow variable length messages
9 / 28
Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission:
Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome
Not very interesting!
Things get interesting if we:
accept errors in transmission (this week)
allow variable length messages (next week)
9 / 28
Sneak peek: source coding theorem
If we are happy to fail on up to 2% of the sequences we can ignore any
sequence of 10 outcomes with more than 3 tails
Why? The number of tails follows a Binomial(10, 0.1) distribution
There are only 176 < 28 sequences with 3 or fewer tails
So, we can just code those, and ignore the rest!
Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8
bits/outcome
Smallest bits/outcome needed for 10,000 outcome sequences?
10 / 28
Sneak peek: source coding theorem
If we are happy to fail on up to 2% of the sequences we can ignore any
sequence of 10 outcomes with more than 3 tails
Why? The number of tails follows a Binomial(10, 0.1) distribution
There are only 176 < 28 sequences with 3 or fewer tails
So, we can just code those, and ignore the rest!
Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8
bits/outcome
Smallest bits/outcome needed for 10,000 outcome sequences?
10 / 28
Generalisation: Source Coding Theorem
What happens when we generalise to arbitrary error probability, and
sequence size?
Source Coding Theorem (Informal Statement)
If: you want to uniformly code large sequences of outcomes with any
degree of reliability from a random source
Then: the average number of bits per outcome you will need is roughly
equal to the entropy of that source.
To define: “Uniformly code”, “large sequences”, “degree of reliability”,
“average number of bits per outcome”, “roughly equal”
11 / 28
Generalisation: Source Coding Theorem
What happens when we generalise to arbitrary error probability, and
sequence size?
Source Coding Theorem (Informal Statement)
If: you want to uniformly code large sequences of outcomes with any
degree of reliability from a random source
Then: the average number of bits per outcome you will need is roughly
equal to the entropy of that source.
To define: “Uniformly code”, “large sequences”, “degree of reliability”,
“average number of bits per outcome”, “roughly equal”
11 / 28
Generalisation: Source Coding Theorem
What happens when we generalise to arbitrary error probability, and
sequence size?
Source Coding Theorem (Informal Statement)
If: you want to uniformly code large sequences of outcomes with any
degree of reliability from a random source
Then: the average number of bits per outcome you will need is roughly
equal to the entropy of that source.
To define: “Uniformly code”, “large sequences”, “degree of reliability”,
“average number of bits per outcome”, “roughly equal”
11 / 28
1 Introduction
Overview
What is Compression?
A Communication Game
What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review
Defining Codes
3 Formalising Compression
Reliability vs. Size
Key Result: The Source Coding Theorem
12 / 28
Entropy and Information: Recap
Ensemble
An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking
values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}.
Information
The information in the observation that x = ai (in the ensemble X ) is
h(ai) = log2
1
pi
= − log2 pi
Entropy
The entropy of an ensemble X is the average information
H(X ) = E[h(x)] =
∑
i
pih(ai) =
∑
i
pi log2
1
pi
13 / 28
Entropy and Information: Recap
Ensemble
An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking
values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}.
Information
The information in the observation that x = ai (in the ensemble X ) is
h(ai) = log2
1
pi
= − log2 pi
Entropy
The entropy of an ensemble X is the average information
H(X ) = E[h(x)] =
∑
i
pih(ai) =
∑
i
pi log2
1
pi
13 / 28
Entropy and Information: Recap
Ensemble
An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking
values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}.
Information
The information in the observation that x = ai (in the ensemble X ) is
h(ai) = log2
1
pi
= − log2 pi
Entropy
The entropy of an ensemble X is the average information
H(X ) = E[h(x)] =
∑
i
pih(ai) =
∑
i
pi log2
1
pi
13 / 28
What is a Code?
A source code is a process for assigning names to outcomes. The names
are typically expressed by strings of binary symbols.
We will denote the set of all finite binary strings by
{0, 1}+ def= {0, 1, 00, 01, 10, . . .}
Source Code
Given an ensemble X , the function c : AX → {0, 1}+ is a source code for
X . The number of symbols in c(x) is the length l(x) of the codeword for x .
The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn)
Example:
The code c names outcomes from AX = {r, g, b} by
c(r) = 00, c(g) = 10, c(b) = 11
The length of the codeword for each outcome is 2.
The extension of c gives c(rgrb) = 00100011
14 / 28
What is a Code?
A source code is a process for assigning names to outcomes. The names
are typically expressed by strings of binary symbols.
We will denote the set of all finite binary strings by
{0, 1}+ def= {0, 1, 00, 01, 10, . . .}
Source Code
Given an ensemble X , the function c : AX → {0, 1}+ is a source code for
X . The number of symbols in c(x) is the length l(x) of the codeword for x .
The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn)
Example:
The code c names outcomes from AX = {r, g, b} by
c(r) = 00, c(g) = 10, c(b) = 11
The length of the codeword for each outcome is 2.
The extension of c gives c(rgrb) = 00100011
14 / 28
Types of Codes
Let X be an ensemble and c : AX → {0, 1}+ a code for X . We say c is a:
Uniform Code if l(x) is the same for all x ∈ AX
Variable-Length Code otherwise
Another important criteria for codes is whether the original symbol x can
be unambiguously determined given c(x). We say c is a:
Lossless Code if for all x1, x2 ∈ AX we have x1 6= x2 implies
c(x1) 6= c(x2)
Lossy Code otherwise
15 / 28
Types of Codes
Let X be an ensemble and c : AX → {0, 1}+ a code for X . We say c is a:
Uniform Code if l(x) is the same for all x ∈ AX
Variable-Length Code otherwise
Another important criteria for codes is whether the original symbol x can
be unambiguously determined given c(x). We say c is a:
Lossless Code if for all x1, x2 ∈ AX we have x1 6= x2 implies
c(x1) 6= c(x2)
Lossy Code otherwise
15 / 28
Types of Codes
Examples
Examples: Let AX = {a, b, c, d}
1 c(a) = 00, c(b) = 01, c(c) = 10, c(d) = 11 is uniform lossless
2 c(a) = 0, c(b) = 10, c(c) = 110, c(d) = 111 is variable-length lossless
3 c(a) = 0, c(b) = 0, c(c) = 110, c(d) = 111 is variable-length lossy
4 c(a) = 00, c(b) = 00, c(c) = 10, c(d) = 11 is uniform lossy
5 c(a) = −, c(b) = −, c(c) = 10, c(d) = 11 is uniform lossy
16 / 28
A Note on Lossy Codes & Missing Codewords
When talking about a uniform lossy code c for AX = {a, b, c} we write
c(a) = 0 c(b) = 1 c(c) = -
where the symbol - means “no codeword”. This is shorthand for “the
receiver will decode this codeword incorrectly”
For the purposes of these lectures, this is equivalent to the code
c(a) = 0 c(b) = 1 c(c) = 1
and the sender and receiver agreeing that the codeword 1 should always
be decoded as b
Assigning the outcome ai the missing codeword “-” just means “it is not
possible to send ai reliably”
17 / 28
1 Introduction
Overview
What is Compression?
A Communication Game
What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review
Defining Codes
3 Formalising Compression
Reliability vs. Size
Key Result: The Source Coding Theorem
18 / 28
Lossless Coding
Example: Colours
Three colour ensemble with AX = {r, g, b}
with r twice as likely as b or g
pr = 0.5 and pg = pb = 0.25.
Suppose we use the following uniform lossless code
c(r) = 00; c(g) = 10; and c(b) = 11
For example c(rrgbrbr) = 00001011001100 will have 14 bits.
On average, we will use l(r)pr + l(g)pg + l(b)pb = 2 bits per outcome
2N bits to code a sequence of N outcomes
19 / 28
Raw Bit Content
Uniform coding gives a crude measure of information: the number of bits
required to assign equal length codes to each symbol
Raw Bit Content
If X is an ensemble with outcome set AX then its raw bit content is
H0(X ) = log2 |AX |.
x c(x)
a 000
b 001
c 010
d 011
e 100
f 101
g 110
h 111
Example:
This is a uniform encoding of outcomes in
AX = {a, b, c, d, e, f, g, h}:
Each outcome is encoded using H0(X ) = 3 bits
The probabilities of the outcomes are ignored
Same as assuming a uniform distribution
For the purposes of compression, the exact codes
don’t matter – just the number of bits used.
20 / 28
Lossy Coding
Example: Colours
Three colour ensemble with AX = {r, g, b}
pr = 0.5 and pg = pb = 0.25.
Using uniform lossy code:
c(r) = 0; c(g) = −; and c(b) = 1
Examples:
c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = −
What is probability we can reliably code a sequence of N outcomes?
Given we can code a sequence of length N, how many bits are expected?
21 / 28
Lossy Coding
Example: Colours
Three colour ensemble with AX = {r, g, b}
pr = 0.5 and pg = pb = 0.25.
Using uniform lossy code:
c(r) = 0; c(g) = −; and c(b) = 1
Examples:
c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = −
What is probability we can reliably code a sequence of N outcomes?
Given we can code a sequence of length N, how many bits are expected?
21 / 28
Lossy Coding
Example: Colours
What is probability we can reliably code a sequence of N outcomes?
P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] =
N∑
n=1
E[l(Xn)|Xn 6= g]
=N (l(r)pr + l(b)pb) /(1− pg) = N
= N log2 |{r, b}|
since l(pr) = l(pb) = 1 and pr + pb = 1− pg.
c.f. 2N bits with lossless code
22 / 28
Lossy Coding
Example: Colours
What is probability we can reliably code a sequence of N outcomes?
P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] =
N∑
n=1
E[l(Xn)|Xn 6= g]
=N (l(r)pr + l(b)pb) /(1− pg) = N
= N log2 |{r, b}|
since l(pr) = l(pb) = 1 and pr + pb = 1− pg.
c.f. 2N bits with lossless code
22 / 28
Lossy Coding
Example: Colours
What is probability we can reliably code a sequence of N outcomes?
P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] =
N∑
n=1
E[l(Xn)|Xn 6= g]
=N (l(r)pr + l(b)pb) /(1− pg) = N
= N log2 |{r, b}|
since l(pr) = l(pb) = 1 and pr + pb = 1− pg.
c.f. 2N bits with lossless code
22 / 28
Lossy Coding
Example: Colours
What is probability we can reliably code a sequence of N outcomes?
P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] =
N∑
n=1
E[l(Xn)|Xn 6= g]
=N (l(r)pr + l(b)pb) /(1− pg) = N = N log2 |{r, b}|
since l(pr) = l(pb) = 1 and pr + pb = 1− pg.
c.f. 2N bits with lossless code
22 / 28
Essential Bit Content
There is an inherent trade off between the number of bits required in a
uniform lossy code and the probability of being able to code an outcome
Smallest δ-sufficient subset
Let X be an ensemble and for 0 ≤ δ ≤ 1, define Sδ to be the smallest
subset of AX such that
P(x ∈ Sδ) ≥ 1− δ
For small δ, smallest collection of most likely outcomes
If we uniformly code elements in Sδ, and ignore all others:
We can code a sequence of length N with probability (1− δ)N
If we can code a sequence, its expected length is N log2 |Sδ|
23 / 28
Essential Bit Content
There is an inherent trade off between the number of bits required in a
uniform lossy code and the probability of being able to code an outcome
Smallest δ-sufficient subset
Let X be an ensemble and for 0 ≤ δ ≤ 1, define Sδ to be the smallest
subset of AX such that
P(x ∈ Sδ) ≥ 1− δ
For small δ, smallest collection of most likely outcomes
If we uniformly code elements in Sδ, and ignore all others:
We can code a sequence of length N with probability (1− δ)N
If we can code a sequence, its expected length is N log2 |Sδ|
23 / 28
Essential Bit Content
Example
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
24 / 28
Essential Bit Content
Example
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
24 / 28
Essential Bit Content
Example
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
24 / 28
Essential Bit Content
Example
Intuitively, construct Sδ by removing elements of X in ascending order of
probability, till we have reached the 1− δ threshold
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
Outcomes ranked (high–low) by P(x = ai)
removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ
δ = 0 : Sδ = {a, b, c, d, e, f, g, h}
δ = 1/64 : Sδ = {a, b, c, d, e, f, g}
δ = 1/16 : Sδ = {a, b, c, d}
δ = 3/4 : Sδ = {a}
24 / 28
Essential Bit Content
Trade off between a probability of δ of not coding an outcome and size of
uniform code is captured by the essential bit content
Essential Bit Content
For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is
Hδ(X )
def
= log2 |Sδ|
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
(a)
!
log2 P (x)!2!2.4!4!6
S0 S 116
a,b,cde,f,g,h
"""
(b)
H!(X)
0
0.5
1
1.5
2
2.5
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
{a,b}
{a,b,c}
{a,b,c,d}
{a,b,c,d,e}
{a,b,c,d,e,f}
{a}
{a,b,c,d,e,f,g}
{a,b,c,d,e,f,g,h}
!
25 / 28
Essential Bit Content
Trade off between a probability of δ of not coding an outcome and size of
uniform code is captured by the essential bit content
Essential Bit Content
For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is
Hδ(X )
def
= log2 |Sδ|
x P(x)
a 1/4
b 1/4
c 1/4
d 3/16
e 1/64
f 1/64
g 1/64
h 1/64
(a)
!
log2 P (x)!2!2.4!4!6
S0 S 116
a,b,cde,f,g,h
"""
(b)
H!(X)
0
0.5
1
1.5
2
2.5
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
{a,b}
{a,b,c}
{a,b,c,d}
{a,b,c,d,e}
{a,b,c,d,e,f}
{a}
{a,b,c,d,e,f,g}
{a,b,c,d,e,f,g,h}
!
25 / 28
The Source Coding Theorem for Uniform Codes
(Theorem 4.1 in MacKay)
Our aim next time is to understand this:
The Source Coding Theorem for Uniform Codes
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �.
What?
The term 1N Hδ(X
N) is the average number of bits required to
uniformly code all but a proportion δ of the symbols.
Given a tiny probability of error δ, the average bits per symbol can be
made as close to H as required.
Even if we allow a large probability of error we cannot compress more
than H bits ber symbol.
26 / 28
The Source Coding Theorem for Uniform Codes
(Theorem 4.1 in MacKay)
Our aim next time is to understand this:
The Source Coding Theorem for Uniform Codes
Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0
∣∣∣∣
1
N
Hδ
(
X N
)
− H
∣∣∣∣ < �. What? The term 1N Hδ(X N) is the average number of bits required to uniformly code all but a proportion δ of the symbols. Given a tiny probability of error δ, the average bits per symbol can be made as close to H as required. Even if we allow a large probability of error we cannot compress more than H bits ber symbol. 26 / 28 Some Intuition for the SCT Don’t code individual symbols in an ensemble; rather, consider sequences of length N. As length of sequence increases, the probability of seeing a “typical” sequence becomes much larger than “atypical” sequences. Thus, we can get by with essentially assigning a unique codeword to each typical sequence 27 / 28 Next time Recap: typical sets Formal statement of source coding theorem Proof of source coding theorem 28 / 28 Introduction Overview What is Compression? A Communication Game What's the best we can do? Formalising Coding Entropy and Information: A Quick Review Defining Codes Formalising Compression Reliability vs. Size Key Result: The Source Coding Theorem