程序代写代做代考 information theory algorithm AI chain COMP2610 – Information Theory – Lecture 11: Entropy and Coding

COMP2610 – Information Theory – Lecture 11: Entropy and Coding

COMP2610 – Information Theory
Lecture 11: Entropy and Coding

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

27 August 2018

1 / 28

Brief Recap of Course (Last 6 Weeks)

How can we quantify information?
I Basic Definitions and Key Concepts
I Probability, Entropy & Information

How can we make good guesses?
I Probabilistic Inference
I Bayes Theorem

How much redundancy can we safely remove?
I Compression
I Source Coding Theorem, Kraft Inequality
I Block, Huffman, and Lempel-Ziv Coding

How much noise can we correct and how?
I Noisy-Channel Coding
I Repetition Codes, Hamming Codes

What is randomness?
I Kolmogorov Complexity
I Algorithmic Information Theory

2 / 28

Brief Overview of Course (Next 6 Weeks)

How can we quantify information?
I Basic Definitions and Key Concepts
I Probability, Entropy & Information

How can we make good guesses?
I Probabilistic Inference
I Bayes Theorem

How much redundancy can we safely remove?
I Compression
I Source Coding Theorem, Kraft Inequality
I Block, Huffman, and Lempel-Ziv Coding

How much noise can we correct and how?
I Noisy-Channel Coding
I Repetition Codes, Hamming Codes

What is randomness?
I Kolmogorov Complexity
I Algorithmic Information Theory

3 / 28

Brief Overview of Course (Next 6 Weeks)

How can we quantify information?
I Basic Definitions and Key Concepts
I Probability, Entropy & Information

How can we make good guesses?
I Probabilistic Inference
I Bayes Theorem

How much redundancy can we safely remove?
I Compression
I Source Coding Theorem, Kraft Inequality
I Block, Huffman, and Lempel-Ziv Coding

How much noise can we correct and how?
I Noisy-Channel Coding
I Repetition Codes, Hamming Codes

What is randomness?
I Kolmogorov Complexity
I Algorithmic Information Theory

3 / 28

This time

Basic goal of compression

Key concepts: codes and their types, raw bit content, essential bit content

Informal statement of source coding theorem

4 / 28

1 Introduction
Overview
What is Compression?
A Communication Game
What’s the best we can do?

2 Formalising Coding
Entropy and Information: A Quick Review
Defining Codes

3 Formalising Compression
Reliability vs. Size
Key Result: The Source Coding Theorem

5 / 28

What is Compression?

Cn y rd ths mssg wtht ny vwls?

It is not too difficult to read as there is redundancy in English text.
(Estimates of 1-1.5 bits per character, compared to log2 26 ≈ 4.7)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

24 2 — Probability, Entropy, and Inference

abcdefghijklmnopqrstuvwxyz– y

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

x

abcdefghijklmnopqrstuvwxyz– y

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

x

(a) P (y | x) (b) P (x | y)

Figure 2.3. Conditional
probability distributions. (a)
P (y |x): Each row shows the
conditional distribution of the
second letter, y, given the first
letter, x, in a bigram xy. (b)
P (x | y): Each column shows the
conditional distribution of the
first letter, x, given the second
letter, y.

that the first letter x is q are u and -. (The space is common after q
because the source document makes heavy use of the word FAQ.)

The probability P (x | y =u) is the probability distribution of the first
letter x given that the second letter y is a u. As you can see in figure 2.3b
the two most probable values for x given y =u are n and o.

Rather than writing down the joint probability directly, we often define an
ensemble in terms of a collection of conditional probabilities. The following
rules of probability theory will be useful. (H denotes assumptions on which
the probabilities are based.)

Product rule – obtained from the definition of conditional probability:

P (x, y |H) = P (x | y,H)P (y |H) = P (y |x,H)P (x |H). (2.6)

This rule is also known as the chain rule.

Sum rule – a rewriting of the marginal probability definition:

P (x |H) =
!

y

P (x, y |H) (2.7)

=
!

y

P (x | y,H)P (y |H). (2.8)

Bayes’ theorem – obtained from the product rule:

P (y |x,H) = P (x | y,H)P (y |H)
P (x |H) (2.9)

=
P (x | y,H)P (y |H)”
y! P (x | y!,H)P (y! |H)

. (2.10)

Independence. Two random variables X and Y are independent (sometimes
written X!Y ) if and only if

P (x, y) = P (x)P (y). (2.11)

Exercise 2.2.[1, p.40] Are the random variables X and Y in the joint ensemble
of figure 2.2 independent?

If you see a “q”, it is very likely
to be followed with a “u”
The letter “e” is much more
common than “j”
Compression exploits
differences in relative probability
of symbols or blocks of symbols

6 / 28

What is Compression?

Cn y rd ths mssg wtht ny vwls?

It is not too difficult to read as there is redundancy in English text.
(Estimates of 1-1.5 bits per character, compared to log2 26 ≈ 4.7)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

24 2 — Probability, Entropy, and Inference

abcdefghijklmnopqrstuvwxyz– y

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

x

abcdefghijklmnopqrstuvwxyz– y

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

x

(a) P (y | x) (b) P (x | y)

Figure 2.3. Conditional
probability distributions. (a)
P (y |x): Each row shows the
conditional distribution of the
second letter, y, given the first
letter, x, in a bigram xy. (b)
P (x | y): Each column shows the
conditional distribution of the
first letter, x, given the second
letter, y.

that the first letter x is q are u and -. (The space is common after q
because the source document makes heavy use of the word FAQ.)

The probability P (x | y =u) is the probability distribution of the first
letter x given that the second letter y is a u. As you can see in figure 2.3b
the two most probable values for x given y =u are n and o.

Rather than writing down the joint probability directly, we often define an
ensemble in terms of a collection of conditional probabilities. The following
rules of probability theory will be useful. (H denotes assumptions on which
the probabilities are based.)

Product rule – obtained from the definition of conditional probability:

P (x, y |H) = P (x | y,H)P (y |H) = P (y |x,H)P (x |H). (2.6)

This rule is also known as the chain rule.

Sum rule – a rewriting of the marginal probability definition:

P (x |H) =
!

y

P (x, y |H) (2.7)

=
!

y

P (x | y,H)P (y |H). (2.8)

Bayes’ theorem – obtained from the product rule:

P (y |x,H) = P (x | y,H)P (y |H)
P (x |H) (2.9)

=
P (x | y,H)P (y |H)”
y! P (x | y!,H)P (y! |H)

. (2.10)

Independence. Two random variables X and Y are independent (sometimes
written X!Y ) if and only if

P (x, y) = P (x)P (y). (2.11)

Exercise 2.2.[1, p.40] Are the random variables X and Y in the joint ensemble
of figure 2.2 independent?

If you see a “q”, it is very likely
to be followed with a “u”
The letter “e” is much more
common than “j”
Compression exploits
differences in relative probability
of symbols or blocks of symbols

6 / 28

Compression in a Nutshell

Compression
Data compression is the process of replacing a message with a smaller
message which can be reliably converted back to the original.

7 / 28

A General Communication Game

Imagine the following game between Sender & Receiver:

Sender & Receiver agree on code for each outcome ahead of time
(e.g., 0 for Heads; 1 for Tails)
Sender observes outcomes then codes and sends message
Receiver decodes message and recovers outcome sequence

Sender Receiver

Coding Decoding
010

Message

Heads, Tails, Heads, … Heads, Tails, Heads, …

Goal: Want small messages on average when outcomes are from a fixed,
known, but uncertain source (e.g., coin flips with known bias)

8 / 28

Sneak peek: source coding theorem

Consider a coin with P(Heads) = 0.9. If we want perfect transmission:

Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome

Not very interesting!

9 / 28

Sneak peek: source coding theorem

Consider a coin with P(Heads) = 0.9. If we want perfect transmission:

Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome

Not very interesting!

Things get interesting if we:

accept errors in transmission
allow variable length messages

9 / 28

Sneak peek: source coding theorem

Consider a coin with P(Heads) = 0.9. If we want perfect transmission:

Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome

Not very interesting!

Things get interesting if we:

accept errors in transmission (this week)
allow variable length messages (next week)

9 / 28

Sneak peek: source coding theorem

If we are happy to fail on up to 2% of the sequences we can ignore any
sequence of 10 outcomes with more than 3 tails

Why? The number of tails follows a Binomial(10, 0.1) distribution

There are only 176 < 28 sequences with 3 or fewer tails So, we can just code those, and ignore the rest! Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8 bits/outcome Smallest bits/outcome needed for 10,000 outcome sequences? 10 / 28 Sneak peek: source coding theorem If we are happy to fail on up to 2% of the sequences we can ignore any sequence of 10 outcomes with more than 3 tails Why? The number of tails follows a Binomial(10, 0.1) distribution There are only 176 < 28 sequences with 3 or fewer tails So, we can just code those, and ignore the rest! Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8 bits/outcome Smallest bits/outcome needed for 10,000 outcome sequences? 10 / 28 Generalisation: Source Coding Theorem What happens when we generalise to arbitrary error probability, and sequence size? Source Coding Theorem (Informal Statement) If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source. To define: “Uniformly code”, “large sequences”, “degree of reliability”, “average number of bits per outcome”, “roughly equal” 11 / 28 Generalisation: Source Coding Theorem What happens when we generalise to arbitrary error probability, and sequence size? Source Coding Theorem (Informal Statement) If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source. To define: “Uniformly code”, “large sequences”, “degree of reliability”, “average number of bits per outcome”, “roughly equal” 11 / 28 Generalisation: Source Coding Theorem What happens when we generalise to arbitrary error probability, and sequence size? Source Coding Theorem (Informal Statement) If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source. To define: “Uniformly code”, “large sequences”, “degree of reliability”, “average number of bits per outcome”, “roughly equal” 11 / 28 1 Introduction Overview What is Compression? A Communication Game What’s the best we can do? 2 Formalising Coding Entropy and Information: A Quick Review Defining Codes 3 Formalising Compression Reliability vs. Size Key Result: The Source Coding Theorem 12 / 28 Entropy and Information: Recap Ensemble An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}. Information The information in the observation that x = ai (in the ensemble X ) is h(ai) = log2 1 pi = − log2 pi Entropy The entropy of an ensemble X is the average information H(X ) = E[h(x)] = ∑ i pih(ai) = ∑ i pi log2 1 pi 13 / 28 Entropy and Information: Recap Ensemble An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}. Information The information in the observation that x = ai (in the ensemble X ) is h(ai) = log2 1 pi = − log2 pi Entropy The entropy of an ensemble X is the average information H(X ) = E[h(x)] = ∑ i pih(ai) = ∑ i pi log2 1 pi 13 / 28 Entropy and Information: Recap Ensemble An ensemble X is a triple (x ,AX ,PX ); x is a random variable taking values in AX = {a1, a2, . . . , aI} with probabilities PX = {p1, p2, . . . , pI}. Information The information in the observation that x = ai (in the ensemble X ) is h(ai) = log2 1 pi = − log2 pi Entropy The entropy of an ensemble X is the average information H(X ) = E[h(x)] = ∑ i pih(ai) = ∑ i pi log2 1 pi 13 / 28 What is a Code? A source code is a process for assigning names to outcomes. The names are typically expressed by strings of binary symbols. We will denote the set of all finite binary strings by {0, 1}+ def= {0, 1, 00, 01, 10, . . .} Source Code Given an ensemble X , the function c : AX → {0, 1}+ is a source code for X . The number of symbols in c(x) is the length l(x) of the codeword for x . The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn) Example: The code c names outcomes from AX = {r, g, b} by c(r) = 00, c(g) = 10, c(b) = 11 The length of the codeword for each outcome is 2. The extension of c gives c(rgrb) = 00100011 14 / 28 What is a Code? A source code is a process for assigning names to outcomes. The names are typically expressed by strings of binary symbols. We will denote the set of all finite binary strings by {0, 1}+ def= {0, 1, 00, 01, 10, . . .} Source Code Given an ensemble X , the function c : AX → {0, 1}+ is a source code for X . The number of symbols in c(x) is the length l(x) of the codeword for x . The extension of c is defined by c(x1 . . . xn) = c(x1) . . . c(xn) Example: The code c names outcomes from AX = {r, g, b} by c(r) = 00, c(g) = 10, c(b) = 11 The length of the codeword for each outcome is 2. The extension of c gives c(rgrb) = 00100011 14 / 28 Types of Codes Let X be an ensemble and c : AX → {0, 1}+ a code for X . We say c is a: Uniform Code if l(x) is the same for all x ∈ AX Variable-Length Code otherwise Another important criteria for codes is whether the original symbol x can be unambiguously determined given c(x). We say c is a: Lossless Code if for all x1, x2 ∈ AX we have x1 6= x2 implies c(x1) 6= c(x2) Lossy Code otherwise 15 / 28 Types of Codes Let X be an ensemble and c : AX → {0, 1}+ a code for X . We say c is a: Uniform Code if l(x) is the same for all x ∈ AX Variable-Length Code otherwise Another important criteria for codes is whether the original symbol x can be unambiguously determined given c(x). We say c is a: Lossless Code if for all x1, x2 ∈ AX we have x1 6= x2 implies c(x1) 6= c(x2) Lossy Code otherwise 15 / 28 Types of Codes Examples Examples: Let AX = {a, b, c, d} 1 c(a) = 00, c(b) = 01, c(c) = 10, c(d) = 11 is uniform lossless 2 c(a) = 0, c(b) = 10, c(c) = 110, c(d) = 111 is variable-length lossless 3 c(a) = 0, c(b) = 0, c(c) = 110, c(d) = 111 is variable-length lossy 4 c(a) = 00, c(b) = 00, c(c) = 10, c(d) = 11 is uniform lossy 5 c(a) = −, c(b) = −, c(c) = 10, c(d) = 11 is uniform lossy 16 / 28 A Note on Lossy Codes & Missing Codewords When talking about a uniform lossy code c for AX = {a, b, c} we write c(a) = 0 c(b) = 1 c(c) = - where the symbol - means “no codeword”. This is shorthand for “the receiver will decode this codeword incorrectly” For the purposes of these lectures, this is equivalent to the code c(a) = 0 c(b) = 1 c(c) = 1 and the sender and receiver agreeing that the codeword 1 should always be decoded as b Assigning the outcome ai the missing codeword “-” just means “it is not possible to send ai reliably” 17 / 28 1 Introduction Overview What is Compression? A Communication Game What’s the best we can do? 2 Formalising Coding Entropy and Information: A Quick Review Defining Codes 3 Formalising Compression Reliability vs. Size Key Result: The Source Coding Theorem 18 / 28 Lossless Coding Example: Colours Three colour ensemble with AX = {r, g, b} with r twice as likely as b or g pr = 0.5 and pg = pb = 0.25. Suppose we use the following uniform lossless code c(r) = 00; c(g) = 10; and c(b) = 11 For example c(rrgbrbr) = 00001011001100 will have 14 bits. On average, we will use l(r)pr + l(g)pg + l(b)pb = 2 bits per outcome 2N bits to code a sequence of N outcomes 19 / 28 Raw Bit Content Uniform coding gives a crude measure of information: the number of bits required to assign equal length codes to each symbol Raw Bit Content If X is an ensemble with outcome set AX then its raw bit content is H0(X ) = log2 |AX |. x c(x) a 000 b 001 c 010 d 011 e 100 f 101 g 110 h 111 Example: This is a uniform encoding of outcomes in AX = {a, b, c, d, e, f, g, h}: Each outcome is encoded using H0(X ) = 3 bits The probabilities of the outcomes are ignored Same as assuming a uniform distribution For the purposes of compression, the exact codes don’t matter – just the number of bits used. 20 / 28 Lossy Coding Example: Colours Three colour ensemble with AX = {r, g, b} pr = 0.5 and pg = pb = 0.25. Using uniform lossy code: c(r) = 0; c(g) = −; and c(b) = 1 Examples: c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = − What is probability we can reliably code a sequence of N outcomes? Given we can code a sequence of length N, how many bits are expected? 21 / 28 Lossy Coding Example: Colours Three colour ensemble with AX = {r, g, b} pr = 0.5 and pg = pb = 0.25. Using uniform lossy code: c(r) = 0; c(g) = −; and c(b) = 1 Examples: c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = − What is probability we can reliably code a sequence of N outcomes? Given we can code a sequence of length N, how many bits are expected? 21 / 28 Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] = N∑ n=1 E[l(Xn)|Xn 6= g] =N (l(r)pr + l(b)pb) /(1− pg) = N = N log2 |{r, b}| since l(pr) = l(pb) = 1 and pr + pb = 1− pg. c.f. 2N bits with lossless code 22 / 28 Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] = N∑ n=1 E[l(Xn)|Xn 6= g] =N (l(r)pr + l(b)pb) /(1− pg) = N = N log2 |{r, b}| since l(pr) = l(pb) = 1 and pr + pb = 1− pg. c.f. 2N bits with lossless code 22 / 28 Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] = N∑ n=1 E[l(Xn)|Xn 6= g] =N (l(r)pr + l(b)pb) /(1− pg) = N = N log2 |{r, b}| since l(pr) = l(pb) = 1 and pr + pb = 1− pg. c.f. 2N bits with lossless code 22 / 28 Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1 . . . xN has no g) = P(x1 6= g) . . .P(xN 6= g) = (1− pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1) + · · ·+ l(XN)|X1 6= g, . . . ,XN 6= g] = N∑ n=1 E[l(Xn)|Xn 6= g] =N (l(r)pr + l(b)pb) /(1− pg) = N = N log2 |{r, b}| since l(pr) = l(pb) = 1 and pr + pb = 1− pg. c.f. 2N bits with lossless code 22 / 28 Essential Bit Content There is an inherent trade off between the number of bits required in a uniform lossy code and the probability of being able to code an outcome Smallest δ-sufficient subset Let X be an ensemble and for 0 ≤ δ ≤ 1, define Sδ to be the smallest subset of AX such that P(x ∈ Sδ) ≥ 1− δ For small δ, smallest collection of most likely outcomes If we uniformly code elements in Sδ, and ignore all others: We can code a sequence of length N with probability (1− δ)N If we can code a sequence, its expected length is N log2 |Sδ| 23 / 28 Essential Bit Content There is an inherent trade off between the number of bits required in a uniform lossy code and the probability of being able to code an outcome Smallest δ-sufficient subset Let X be an ensemble and for 0 ≤ δ ≤ 1, define Sδ to be the smallest subset of AX such that P(x ∈ Sδ) ≥ 1− δ For small δ, smallest collection of most likely outcomes If we uniformly code elements in Sδ, and ignore all others: We can code a sequence of length N with probability (1− δ)N If we can code a sequence, its expected length is N log2 |Sδ| 23 / 28 Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1− δ threshold x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ δ = 0 : Sδ = {a, b, c, d, e, f, g, h} δ = 1/64 : Sδ = {a, b, c, d, e, f, g} δ = 1/16 : Sδ = {a, b, c, d} δ = 3/4 : Sδ = {a} 24 / 28 Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1− δ threshold x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ δ = 0 : Sδ = {a, b, c, d, e, f, g, h} δ = 1/64 : Sδ = {a, b, c, d, e, f, g} δ = 1/16 : Sδ = {a, b, c, d} δ = 3/4 : Sδ = {a} 24 / 28 Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1− δ threshold x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ δ = 0 : Sδ = {a, b, c, d, e, f, g, h} δ = 1/64 : Sδ = {a, b, c, d, e, f, g} δ = 1/16 : Sδ = {a, b, c, d} δ = 3/4 : Sδ = {a} 24 / 28 Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1− δ threshold x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1− δ δ = 0 : Sδ = {a, b, c, d, e, f, g, h} δ = 1/64 : Sδ = {a, b, c, d, e, f, g} δ = 1/16 : Sδ = {a, b, c, d} δ = 3/4 : Sδ = {a} 24 / 28 Essential Bit Content Trade off between a probability of δ of not coding an outcome and size of uniform code is captured by the essential bit content Essential Bit Content For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is Hδ(X ) def = log2 |Sδ| x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 (a) ! log2 P (x)!2!2.4!4!6 S0 S 116 a,b,cde,f,g,h """ (b) H!(X) 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 {a,b} {a,b,c} {a,b,c,d} {a,b,c,d,e} {a,b,c,d,e,f} {a} {a,b,c,d,e,f,g} {a,b,c,d,e,f,g,h} ! 25 / 28 Essential Bit Content Trade off between a probability of δ of not coding an outcome and size of uniform code is captured by the essential bit content Essential Bit Content For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is Hδ(X ) def = log2 |Sδ| x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 (a) ! log2 P (x)!2!2.4!4!6 S0 S 116 a,b,cde,f,g,h """ (b) H!(X) 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 {a,b} {a,b,c} {a,b,c,d} {a,b,c,d,e} {a,b,c,d,e,f} {a} {a,b,c,d,e,f,g} {a,b,c,d,e,f,g,h} ! 25 / 28 The Source Coding Theorem for Uniform Codes (Theorem 4.1 in MacKay) Our aim next time is to understand this: The Source Coding Theorem for Uniform Codes Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0

∣∣∣∣
1
N


(
X N
)
− H

∣∣∣∣ < �. What? The term 1N Hδ(X N) is the average number of bits required to uniformly code all but a proportion δ of the symbols. Given a tiny probability of error δ, the average bits per symbol can be made as close to H as required. Even if we allow a large probability of error we cannot compress more than H bits ber symbol. 26 / 28 The Source Coding Theorem for Uniform Codes (Theorem 4.1 in MacKay) Our aim next time is to understand this: The Source Coding Theorem for Uniform Codes Let X be an ensemble with entropy H = H(X ) bits. Given � > 0 and
0 < δ < 1, there exists a positive integer N0 such that for all N > N0

∣∣∣∣
1
N


(
X N
)
− H

∣∣∣∣ < �. What? The term 1N Hδ(X N) is the average number of bits required to uniformly code all but a proportion δ of the symbols. Given a tiny probability of error δ, the average bits per symbol can be made as close to H as required. Even if we allow a large probability of error we cannot compress more than H bits ber symbol. 26 / 28 Some Intuition for the SCT Don’t code individual symbols in an ensemble; rather, consider sequences of length N. As length of sequence increases, the probability of seeing a “typical” sequence becomes much larger than “atypical” sequences. Thus, we can get by with essentially assigning a unique codeword to each typical sequence 27 / 28 Next time Recap: typical sets Formal statement of source coding theorem Proof of source coding theorem 28 / 28 Introduction Overview What is Compression? A Communication Game What's the best we can do? Formalising Coding Entropy and Information: A Quick Review Defining Codes Formalising Compression Reliability vs. Size Key Result: The Source Coding Theorem