COMP2610 / 6261 – Information Theory – Lecture 14: Source Coding Theorem for Symbol Codes
COMP2610 / 6261 – Information Theory
Lecture 14: Source Coding Theorem for Symbol Codes
Robert C. Williamson
Research School of Computer Science
The Australian National University
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
17 September, 2018
1 / 34
Last time
Variable-length codes
Uniquely decodable and prefix codes
Prefix codes as trees
Kraft’s inequality:
Lengths {`i}Ii=1 can form a prefix code ⇐⇒
∑I
i=1 2
−i ≤ 1
How to generate prefix codes?
2 / 34
Prefix Codes (Recap)
A simple property of codes guarantees unique decodeability
Prefix property
A codeword c ∈ {0, 1}+ is said to be a prefix of another codeword
c′ ∈ {0, 1}+ if there exists a string t ∈ {0, 1}+ such that c′ = ct.
Can you create c′ by gluing something to the end of c?
Example: 01101 has prefixes 0, 01, 011, 0110.
Prefix Codes
A code C = {c1, . . . , cI} is a prefix code if for every codeword ci ∈ C
there is no prefix of ci in C.
In a stream, no confusing one codeword with another
3 / 34
Prefix Codes as Trees (Recap)
C2 = {0, 10, 110, 111}
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
111
110
101
100
011
010
001
000
11
10
01
00
0
1
T
he
to
ta
l s
ym
bo
l c
od
e
bu
dg
et
4 / 34
This time
Bound on expected length for a prefix code
Shannon codes
Huffman coding
5 / 34
1 Expected Code Length
Minimising Expected Code Length
Shannon Coding
2 The Source Coding Theorem for Symbol Codes
3 Huffman Coding
Algorithm and Examples
Advantages and Disadvantages
6 / 34
1 Expected Code Length
Minimising Expected Code Length
Shannon Coding
2 The Source Coding Theorem for Symbol Codes
3 Huffman Coding
Algorithm and Examples
Advantages and Disadvantages
7 / 34
Expected Code Length
With uniform codes, the length of a message of N outcomes is trivial to
compute
With variable-length codes, the length of a message of N outcomes will
depend on the outcomes we observe
Outcomes we observe have some uncertainty
On average, what length of message can we expect?
Expected Code Length
The expected length for a code C for ensemble X with AX = {a1, . . . , aI}
and PX = {p1, . . . , pI} is
L(C,X ) = E [`(x)] =
∑
x∈AX
p(x) `(x) =
I∑
i=1
pi `i
8 / 34
Expected Code Length
With uniform codes, the length of a message of N outcomes is trivial to
compute
With variable-length codes, the length of a message of N outcomes will
depend on the outcomes we observe
Outcomes we observe have some uncertainty
On average, what length of message can we expect?
Expected Code Length
The expected length for a code C for ensemble X with AX = {a1, . . . , aI}
and PX = {p1, . . . , pI} is
L(C,X ) = E [`(x)] =
∑
x∈AX
p(x) `(x) =
I∑
i=1
pi `i
8 / 34
Expected Code Length: Examples
Example: X has AX = {a, b, c, d} and P = {12 ,
1
4 ,
1
8 ,
1
8}
1 The code C1 = {0001, 0010, 0100, 1000} has
L(C1,X ) =
4∑
i=1
pi `i = 4
2 The code C2 = {0, 10, 110, 111} has
L(C2,X ) =
4∑
i=1
pi `i =
1
2 × 1 +
1
4 × 2 +
1
8 × 3 +
1
8 × 3 = 2.25
9 / 34
Expected Code Length: Examples
Example: X has AX = {a, b, c, d} and P = {12 ,
1
4 ,
1
8 ,
1
8}
1 The code C1 = {0001, 0010, 0100, 1000} has
L(C1,X ) =
4∑
i=1
pi `i = 4
2 The code C2 = {0, 10, 110, 111} has
L(C2,X ) =
4∑
i=1
pi `i =
1
2 × 1 +
1
4 × 2 +
1
8 × 3 +
1
8 × 3 = 2.25
9 / 34
Code Lengths and Probabilities
The Kraft inequality says that {`1, . . . , `I} are prefix code lengths iff
I∑
i=1
2−`i ≤ 1
If it were true that
I∑
i=1
2−`i = 1
then we could interpret
q = (2−`1 , . . . , 2−`I )
as a probability vector over I outcomes
General lengths `?
10 / 34
Code Lengths and Probabilities
Probabilities from Code Lengths
Given code lengths ` = {`1, . . . , `I} such that
∑I
i=1 2
−`i ≤ 1, we define
q = {q1, . . . , qI}, the probabilities for `, by
qi =
2−`i
z
where
z =
∑
i
2−`i
ensure that qi satisfy
∑
i qi = 1.
Note: this implies `i = log2
1
zqi
Examples:
1 Lengths {1, 2, 2} give z = 1 so q1 = 12 , q2 =
1
4 , and q3 =
1
4
2 Lengths {2, 2, 3} give z = 58 so q1 =
2
5 , q2 =
2
5 , and q3 =
1
5
11 / 34
Code Lengths and Probabilities
Probabilities from Code Lengths
Given code lengths ` = {`1, . . . , `I} such that
∑I
i=1 2
−`i ≤ 1, we define
q = {q1, . . . , qI}, the probabilities for `, by
qi =
2−`i
z
where
z =
∑
i
2−`i
ensure that qi satisfy
∑
i qi = 1.
Note: this implies `i = log2
1
zqi
Examples:
1 Lengths {1, 2, 2} give z = 1 so q1 = 12 , q2 =
1
4 , and q3 =
1
4
2 Lengths {2, 2, 3} give z = 58 so q1 =
2
5 , q2 =
2
5 , and q3 =
1
5
11 / 34
Code Lengths and Probabilities
Probabilities from Code Lengths
Given code lengths ` = {`1, . . . , `I} such that
∑I
i=1 2
−`i ≤ 1, we define
q = {q1, . . . , qI}, the probabilities for `, by
qi =
2−`i
z
where
z =
∑
i
2−`i
ensure that qi satisfy
∑
i qi = 1.
Note: this implies `i = log2
1
zqi
Examples:
1 Lengths {1, 2, 2} give z = 1 so q1 = 12 , q2 =
1
4 , and q3 =
1
4
2 Lengths {2, 2, 3} give z = 58 so q1 =
2
5 , q2 =
2
5 , and q3 =
1
5
11 / 34
Minimising Expected Code Length
The probability view of lengths will be useful in answering:
Goal of compression
Given an ensemble X with probabilities PX = p = {p1, . . . , pI} how can
we minimise the expected code length?
In particular, we can relate the expected code length to the relative
entropy (KL divergence) between p,q:
Limits of compression
Given an ensemble X with probabilities p, and prefix code C with
codeword length probabilities q and normalisation z,
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
≥ H(X ),
with equality only when `i = log2
1
pi
.
12 / 34
Minimising Expected Code Length
The probability view of lengths will be useful in answering:
Goal of compression
Given an ensemble X with probabilities PX = p = {p1, . . . , pI} how can
we minimise the expected code length?
In particular, we can relate the expected code length to the relative
entropy (KL divergence) between p,q:
Limits of compression
Given an ensemble X with probabilities p, and prefix code C with
codeword length probabilities q and normalisation z,
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
≥ H(X ),
with equality only when `i = log2
1
pi
.
12 / 34
Minimising Expected Code Length
The probability view of lengths will be useful in answering:
Goal of compression
Given an ensemble X with probabilities PX = p = {p1, . . . , pI} how can
we minimise the expected code length?
In particular, we can relate the expected code length to the relative
entropy (KL divergence) between p,q:
Limits of compression
Given an ensemble X with probabilities p, and prefix code C with
codeword length probabilities q and normalisation z,
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
≥ H(X ),
with equality only when `i = log2
1
pi
.
12 / 34
Minimising Expected Code Length
Suppose we use code C with lengths ` = {`1, . . . , `I} and corresponding
probabilities q = {q1, . . . , qI} with qi = 1z 2
−`i . Then,
L(C,X ) =
∑
i
pi`i
=
∑
i
pi log2
(
1
zqi
)
=
∑
i
pi log2
(
1
zpi
pi
qi
)
=
∑
i
pi
[
log2
(
1
pi
)
+ log2
(
pi
qi
)
+ log2
(
1
z
)]
=
∑
i
pi log2
1
pi
+
∑
i
pi log2
pi
qi
+ log2
(
1
z
)∑
i
pi
= H(X ) + DKL(p‖q) + log2(1/z) · 1
13 / 34
Minimising Expected Code Length
Suppose we use code C with lengths ` = {`1, . . . , `I} and corresponding
probabilities q = {q1, . . . , qI} with qi = 1z 2
−`i . Then,
L(C,X ) =
∑
i
pi`i
=
∑
i
pi log2
(
1
zqi
)
=
∑
i
pi log2
(
1
zpi
pi
qi
)
=
∑
i
pi
[
log2
(
1
pi
)
+ log2
(
pi
qi
)
+ log2
(
1
z
)]
=
∑
i
pi log2
1
pi
+
∑
i
pi log2
pi
qi
+ log2
(
1
z
)∑
i
pi
= H(X ) + DKL(p‖q) + log2(1/z) · 1
13 / 34
Minimising Expected Code Length
Suppose we use code C with lengths ` = {`1, . . . , `I} and corresponding
probabilities q = {q1, . . . , qI} with qi = 1z 2
−`i . Then,
L(C,X ) =
∑
i
pi`i
=
∑
i
pi log2
(
1
zqi
)
=
∑
i
pi log2
(
1
zpi
pi
qi
)
=
∑
i
pi
[
log2
(
1
pi
)
+ log2
(
pi
qi
)
+ log2
(
1
z
)]
=
∑
i
pi log2
1
pi
+
∑
i
pi log2
pi
qi
+ log2
(
1
z
)∑
i
pi
= H(X ) + DKL(p‖q) + log2(1/z) · 1
13 / 34
Minimising Expected Code Length
Suppose we use code C with lengths ` = {`1, . . . , `I} and corresponding
probabilities q = {q1, . . . , qI} with qi = 1z 2
−`i . Then,
L(C,X ) =
∑
i
pi`i
=
∑
i
pi log2
(
1
zqi
)
=
∑
i
pi log2
(
1
zpi
pi
qi
)
=
∑
i
pi
[
log2
(
1
pi
)
+ log2
(
pi
qi
)
+ log2
(
1
z
)]
=
∑
i
pi log2
1
pi
+
∑
i
pi log2
pi
qi
+ log2
(
1
z
)∑
i
pi
= H(X ) + DKL(p‖q) + log2(1/z) · 1
13 / 34
Minimising Expected Code Length
Suppose we use code C with lengths ` = {`1, . . . , `I} and corresponding
probabilities q = {q1, . . . , qI} with qi = 1z 2
−`i . Then,
L(C,X ) =
∑
i
pi`i
=
∑
i
pi log2
(
1
zqi
)
=
∑
i
pi log2
(
1
zpi
pi
qi
)
=
∑
i
pi
[
log2
(
1
pi
)
+ log2
(
pi
qi
)
+ log2
(
1
z
)]
=
∑
i
pi log2
1
pi
+
∑
i
pi log2
pi
qi
+ log2
(
1
z
)∑
i
pi
= H(X ) + DKL(p‖q) + log2(1/z) · 1
13 / 34
Minimising Expected Code Length
Suppose we use code C with lengths ` = {`1, . . . , `I} and corresponding
probabilities q = {q1, . . . , qI} with qi = 1z 2
−`i . Then,
L(C,X ) =
∑
i
pi`i
=
∑
i
pi log2
(
1
zqi
)
=
∑
i
pi log2
(
1
zpi
pi
qi
)
=
∑
i
pi
[
log2
(
1
pi
)
+ log2
(
pi
qi
)
+ log2
(
1
z
)]
=
∑
i
pi log2
1
pi
+
∑
i
pi log2
pi
qi
+ log2
(
1
z
)∑
i
pi
= H(X ) + DKL(p‖q) + log2(1/z) · 1
13 / 34
Minimising Expected Code Length
So if q = {q1, . . . , qI} are the probabilities for the code lengths of C then
under ensemble X with probabilities p = {p1, . . . , pI}
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
Thus, L(C,X ) is minimal (and equal to the entropy H(X )) if we can choose
code lengths so that DKL(p‖q) = 0 and log2 1z = 0
But the relative entropy DKL(p‖q) ≥ 0 with DKL(p‖q) = 0 iff q = p (Gibb’s
inequality)
For q = p, we have z
def
=
∑
i qi =
∑
i pi = 1 and so log2
1
z = 0
14 / 34
Minimising Expected Code Length
So if q = {q1, . . . , qI} are the probabilities for the code lengths of C then
under ensemble X with probabilities p = {p1, . . . , pI}
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
Thus, L(C,X ) is minimal (and equal to the entropy H(X )) if we can choose
code lengths so that DKL(p‖q) = 0 and log2 1z = 0
But the relative entropy DKL(p‖q) ≥ 0 with DKL(p‖q) = 0 iff q = p (Gibb’s
inequality)
For q = p, we have z
def
=
∑
i qi =
∑
i pi = 1 and so log2
1
z = 0
14 / 34
Minimising Expected Code Length
So if q = {q1, . . . , qI} are the probabilities for the code lengths of C then
under ensemble X with probabilities p = {p1, . . . , pI}
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
Thus, L(C,X ) is minimal (and equal to the entropy H(X )) if we can choose
code lengths so that DKL(p‖q) = 0 and log2 1z = 0
But the relative entropy DKL(p‖q) ≥ 0 with DKL(p‖q) = 0 iff q = p (Gibb’s
inequality)
For q = p, we have z
def
=
∑
i qi =
∑
i pi = 1 and so log2
1
z = 0
14 / 34
Minimising Expected Code Length
So if q = {q1, . . . , qI} are the probabilities for the code lengths of C then
under ensemble X with probabilities p = {p1, . . . , pI}
L(C,X ) = H(X ) + DKL(p‖q) + log2
1
z
Thus, L(C,X ) is minimal (and equal to the entropy H(X )) if we can choose
code lengths so that DKL(p‖q) = 0 and log2 1z = 0
But the relative entropy DKL(p‖q) ≥ 0 with DKL(p‖q) = 0 iff q = p (Gibb’s
inequality)
For q = p, we have z
def
=
∑
i qi =
∑
i pi = 1 and so log2
1
z = 0
14 / 34
Entropy as a Lower Bound on Expected Length
We have shown that for a code C with lengths corresponding to q,
L(C,X ) ≥ H(X )
with equality only when C has code lengths `i = log2
1
pi
Once again, the entropy determines a lower bound on how much
compression is possible
L(C,X ) refers to average compression
Individual message length could be bigger than the entropy
15 / 34
Shannon Codes
If we pick lengths `i = log2
1
pi
, we get optimal expected code lengths
But log2
1
pi
is not always an integer—a problem for code lengths!
Shannon Code
Given an ensemble X with PX = {p1, . . . , pI} define codelengths
` = {`1, . . . , `I} by
`i =
⌈
log2
1
pi
⌉
≥ log2
1
pi
.
A code C is called a Shannon code if it has codelengths `.
Here dxe is “smallest integer not smaller than x”. e.g., d2.1e = 3, d5e = 5.
This gives us code lengths that are “closest” to log2
1
pi
16 / 34
Shannon Codes
If we pick lengths `i = log2
1
pi
, we get optimal expected code lengths
But log2
1
pi
is not always an integer—a problem for code lengths!
Shannon Code
Given an ensemble X with PX = {p1, . . . , pI} define codelengths
` = {`1, . . . , `I} by
`i =
⌈
log2
1
pi
⌉
≥ log2
1
pi
.
A code C is called a Shannon code if it has codelengths `.
Here dxe is “smallest integer not smaller than x”. e.g., d2.1e = 3, d5e = 5.
This gives us code lengths that are “closest” to log2
1
pi
16 / 34
Shannon Codes: Examples
Examples:
1 If PX = {12 ,
1
4 ,
1
4} then ` = {1, 2, 2} so C = {0, 10, 11} is a Shannon
code (in fact, this code has optimal length)
2 If PX = {13 ,
1
3 ,
1
3} then ` = {2, 2, 2} with Shannon code
C = {00, 10, 11} (or C = {01, 10, 11} . . . )
17 / 34
Shannon Codes: Examples
Examples:
1 If PX = {12 ,
1
4 ,
1
4} then ` = {1, 2, 2} so C = {0, 10, 11} is a Shannon
code (in fact, this code has optimal length)
2 If PX = {13 ,
1
3 ,
1
3} then ` = {2, 2, 2} with Shannon code
C = {00, 10, 11} (or C = {01, 10, 11} . . . )
17 / 34
Source Coding Theorem for Symbol Codes
Shannon codes let us prove the following:
Source Coding Theorem for Symbol Codes
For any ensemble X , there exists a prefix code C such that
H(X ) ≤ L(C,X )< H(X ) + 1. In particular, Shannon codes C — those with lengths `i = ⌈ log2 1 pi ⌉ — have expected code length within 1 bit of the entropy. Entropy also gives a guideline upper bound of compression 18 / 34 Source Coding Theorem for Symbol Codes Shannon codes let us prove the following: Source Coding Theorem for Symbol Codes For any ensemble X , there exists a prefix code C such that H(X ) ≤ L(C,X )< H(X ) + 1. In particular, Shannon codes C — those with lengths `i = ⌈ log2 1 pi ⌉ — have expected code length within 1 bit of the entropy. Entropy also gives a guideline upper bound of compression 18 / 34 Source Coding Theorem for Symbol Codes Shannon codes let us prove the following: Source Coding Theorem for Symbol Codes For any ensemble X , there exists a prefix code C such that H(X ) ≤ L(C,X )< H(X ) + 1. In particular, Shannon codes C — those with lengths `i = ⌈ log2 1 pi ⌉ — have expected code length within 1 bit of the entropy. Entropy also gives a guideline upper bound of compression 18 / 34 Shannon Codes Since dxe is the smallest integer bigger than or equal to x it must be the case that x ≤ dxe < x + 1. Therefore, if we create a Shannon code C for p = {p1, . . . , pI} with `i = ⌈ log2 1 pi ⌉ < log2 1 pi + 1 it will satisfy L(C,X ) = ∑ ipi`i < ∑ ipi log2 1 pi + 1 = ∑ ipi log2 1 pi + ∑ ipi = H(X ) + 1 Furthermore, since `i ≥ − log2 pi we have 2−`i ≤ 2log2 pi = pi , so∑ i 2 −`i ≤ ∑ i pi = 1. By Kraft there is a prefix code with lengths `i Examples: 1 If PX = {12 , 1 4 , 1 4} then ` = {1, 2, 2} and H(X ) = 3 2 = L(C,X ) 2 If PX = {13 , 1 3 , 1 3} then ` = {2, 2, 2} and H(X ) = log2 3 ≈ 1.58 ≤ L(C,X ) = 2 ≤ 2.58 ≈ H(X ) + 1 19 / 34 Shannon Codes Since dxe is the smallest integer bigger than or equal to x it must be the case that x ≤ dxe < x + 1. Therefore, if we create a Shannon code C for p = {p1, . . . , pI} with `i = ⌈ log2 1 pi ⌉ < log2 1 pi + 1 it will satisfy L(C,X ) = ∑ ipi`i < ∑ ipi log2 1 pi + 1 = ∑ ipi log2 1 pi + ∑ ipi = H(X ) + 1 Furthermore, since `i ≥ − log2 pi we have 2−`i ≤ 2log2 pi = pi , so∑ i 2 −`i ≤ ∑ i pi = 1. By Kraft there is a prefix code with lengths `i Examples: 1 If PX = {12 , 1 4 , 1 4} then ` = {1, 2, 2} and H(X ) = 3 2 = L(C,X ) 2 If PX = {13 , 1 3 , 1 3} then ` = {2, 2, 2} and H(X ) = log2 3 ≈ 1.58 ≤ L(C,X ) = 2 ≤ 2.58 ≈ H(X ) + 1 19 / 34 Shannon Codes Since dxe is the smallest integer bigger than or equal to x it must be the case that x ≤ dxe < x + 1. Therefore, if we create a Shannon code C for p = {p1, . . . , pI} with `i = ⌈ log2 1 pi ⌉ < log2 1 pi + 1 it will satisfy L(C,X ) = ∑ ipi`i < ∑ ipi log2 1 pi + 1 = ∑ ipi log2 1 pi + ∑ ipi = H(X ) + 1 Furthermore, since `i ≥ − log2 pi we have 2−`i ≤ 2log2 pi = pi , so∑ i 2 −`i ≤ ∑ i pi = 1. By Kraft there is a prefix code with lengths `i Examples: 1 If PX = {12 , 1 4 , 1 4} then ` = {1, 2, 2} and H(X ) = 3 2 = L(C,X ) 2 If PX = {13 , 1 3 , 1 3} then ` = {2, 2, 2} and H(X ) = log2 3 ≈ 1.58 ≤ L(C,X ) = 2 ≤ 2.58 ≈ H(X ) + 1 19 / 34 Shannon Codes Since dxe is the smallest integer bigger than or equal to x it must be the case that x ≤ dxe < x + 1. Therefore, if we create a Shannon code C for p = {p1, . . . , pI} with `i = ⌈ log2 1 pi ⌉ < log2 1 pi + 1 it will satisfy L(C,X ) = ∑ ipi`i < ∑ ipi log2 1 pi + 1 = ∑ ipi log2 1 pi + ∑ ipi = H(X ) + 1 Furthermore, since `i ≥ − log2 pi we have 2−`i ≤ 2log2 pi = pi , so∑ i 2 −`i ≤ ∑ i pi = 1. By Kraft there is a prefix code with lengths `i Examples: 1 If PX = {12 , 1 4 , 1 4} then ` = {1, 2, 2} and H(X ) = 3 2 = L(C,X ) 2 If PX = {13 , 1 3 , 1 3} then ` = {2, 2, 2} and H(X ) = log2 3 ≈ 1.58 ≤ L(C,X ) = 2 ≤ 2.58 ≈ H(X ) + 1 19 / 34 Shannon Codes Since dxe is the smallest integer bigger than or equal to x it must be the case that x ≤ dxe < x + 1. Therefore, if we create a Shannon code C for p = {p1, . . . , pI} with `i = ⌈ log2 1 pi ⌉ < log2 1 pi + 1 it will satisfy L(C,X ) = ∑ ipi`i < ∑ ipi log2 1 pi + 1 = ∑ ipi log2 1 pi + ∑ ipi = H(X ) + 1 Furthermore, since `i ≥ − log2 pi we have 2−`i ≤ 2log2 pi = pi , so∑ i 2 −`i ≤ ∑ i pi = 1. By Kraft there is a prefix code with lengths `i Examples: 1 If PX = {12 , 1 4 , 1 4} then ` = {1, 2, 2} and H(X ) = 3 2 = L(C,X ) 2 If PX = {13 , 1 3 , 1 3} then ` = {2, 2, 2} and H(X ) = log2 3 ≈ 1.58 ≤ L(C,X ) = 2 ≤ 2.58 ≈ H(X ) + 1 19 / 34 1 Expected Code Length Minimising Expected Code Length Shannon Coding 2 The Source Coding Theorem for Symbol Codes 3 Huffman Coding Algorithm and Examples Advantages and Disadvantages 20 / 34 The Source Coding Theorem for Symbol Codes The previous arguments have established: Source Coding Theorem for Symbol Codes For any ensemble X there exists a prefix code C such that H(X ) ≤ L(C,X ) < H(X ) + 1. In particular, Shannon codes C — those with lengths `i = ⌈ log2 1 pi ⌉ — have expected code length within 1 bit of the entropy. This is good, but is it optimal? 21 / 34 The Source Coding Theorem for Symbol Codes The previous arguments have established: Source Coding Theorem for Symbol Codes For any ensemble X there exists a prefix code C such that H(X ) ≤ L(C,X ) < H(X ) + 1. In particular, Shannon codes C — those with lengths `i = ⌈ log2 1 pi ⌉ — have expected code length within 1 bit of the entropy. This is good, but is it optimal? 21 / 34 Shannon codes are suboptimal Example: Consider p1 = 0.0001 and p2 = 0.9999. (Note H(X ) ≈ 0.0013) The Shannon code C has lengths `1 = dlog2 10000e = 14 and `2 = ⌈ log2 10000 9999 ⌉ = 1 The expected length is L(C,X ) = 14× 0.0001+ 1× 0.9999 = 1.0013 But clearly C′ = {0, 1} is a prefix code and L(C′,X ) = 1 Shannon codes do not necessarily have smallest expected length This is perhaps disappointing, as these codes were constructed very naturally from the theorem Fortunately, there is another simple code that is provably optimal 22 / 34 Shannon codes are suboptimal Example: Consider p1 = 0.0001 and p2 = 0.9999. (Note H(X ) ≈ 0.0013) The Shannon code C has lengths `1 = dlog2 10000e = 14 and `2 = ⌈ log2 10000 9999 ⌉ = 1 The expected length is L(C,X ) = 14× 0.0001+ 1× 0.9999 = 1.0013 But clearly C′ = {0, 1} is a prefix code and L(C′,X ) = 1 Shannon codes do not necessarily have smallest expected length This is perhaps disappointing, as these codes were constructed very naturally from the theorem Fortunately, there is another simple code that is provably optimal 22 / 34 1 Expected Code Length Minimising Expected Code Length Shannon Coding 2 The Source Coding Theorem for Symbol Codes 3 Huffman Coding Algorithm and Examples Advantages and Disadvantages 23 / 34 Constructing a Huffman Code Huffman Coding is a procedure for making provably optimal prefix codes It assigns the longest codewords to least probable symbols Basic algorithm: Take the two least probable symbols in the alphabet Prepend bits 0 and 1 to current codewords of symbols Combine these two symbols into a single “meta-symbol” Repeat 24 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} a b c 0.5 0.25 0.25 Step%1% Now we read off the labelling implied by path from the last meta-symbol to each of the original symbols: C = {0, 10, 11} 25 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} a b c 0.5 0.25 0.25 0 1 0.5 0.5 Step%1% Now we read off the labelling implied by path from the last meta-symbol to each of the original symbols: C = {0, 10, 11} 25 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} a b c 0.5 0.25 0.25 0 1 0.5 0.5 0 1 Step%1% 1.0 Step%2% Now we read off the labelling implied by path from the last meta-symbol to each of the original symbols: C = {0, 10, 11} 25 / 34 Huffman Coding: Example 2 AX = {a, b, c, d, e} and PX = {0.25, 0.25, 0.2, 0.15, 0.15} Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 5.5: Optimal source coding with symbol codes: Hu!man coding 99 The Hu!man coding algorithm We now present a beautifully simple algorithm for finding an optimal prefix code. The trick is to construct the code backwards starting from the tails of the codewords; we build the binary tree from its leaves. Algorithm 5.4. Hu!man coding algorithm.1. Take the two least probable symbols in the alphabet. These two symbols will be given the longest codewords, which will have equal length, and di!er only in the last digit. 2. Combine these two symbols into a single symbol, and repeat. Since each step reduces the size of the alphabet by one, this algorithm will have assigned strings to all the symbols after |AX | ! 1 steps. Example 5.15. Let AX = { a, b, c, d, e } and PX = { 0.25, 0.25, 0.2, 0.15, 0.15 }. 0.25 0.25 0.2 0.15 0.15 0.25 0.25 0.2 0.3 0.25 0.45 0.3 0.55 0.45 1.0a b c d e 0 1 0 1 0 1 0 1 ! ! ! ! " " " " " ! ! x step 1 step 2 step 3 step 4 The codewords are then obtained by concatenating the binary digits in reverse order: C = {00, 10, 11, 010, 011}. The codelengths selected ai pi h(pi) li c(ai) a 0.25 2.0 2 00 b 0.25 2.0 2 10 c 0.2 2.3 2 11 d 0.15 2.7 3 010 e 0.15 2.7 3 011 Table 5.5. Code created by the Hu!man algorithm. by the Hu!man algorithm (column 4 of table 5.5) are in some cases longer and in some cases shorter than the ideal codelengths, the Shannon information contents log2 1/pi (column 3). The expected length of the code is L = 2.30 bits, whereas the entropy is H = 2.2855 bits. ! If at any point there is more than one way of selecting the two least probable symbols then the choice may be made in any manner – the expected length of the code will not depend on the choice. Exercise 5.16. [3, p.105] Prove that there is no better symbol code for a source than the Hu!man code. Example 5.17. We can make a Hu!man code for the probability distribution over the alphabet introduced in figure 2.1. The result is shown in fig- ure 5.6. This code has an expected length of 4.15 bits; the entropy of the ensemble is 4.11 bits. Observe the disparities between the assigned codelengths and the ideal codelengths log2 1/pi. Constructing a binary tree top-down is suboptimal In previous chapters we studied weighing problems in which we built ternary or binary trees. We noticed that balanced trees – ones in which, at every step, the two possible outcomes were as close as possible to equiprobable – appeared to describe the most e"cient experiments. This gave an intuitive motivation for entropy as a measure of information content. From Example 5.15 of MacKay C = {00, 10, 11, 010, 011} 26 / 34 Huffman Coding: Example 3 English letters – Monogram statistics ai pi log2 1 pi li c(ai) a 0.0575 4.1 4 0000 b 0.0128 6.3 6 001000 c 0.0263 5.2 5 00101 d 0.0285 5.1 5 10000 e 0.0913 3.5 4 1100 f 0.0173 5.9 6 111000 g 0.0133 6.2 6 001001 h 0.0313 5.0 5 10001 i 0.0599 4.1 4 1001 j 0.0006 10.7 10 1101000000 k 0.0084 6.9 7 1010000 l 0.0335 4.9 5 11101 m 0.0235 5.4 6 110101 n 0.0596 4.1 4 0001 o 0.0689 3.9 4 1011 p 0.0192 5.7 6 111001 q 0.0008 10.3 9 110100001 r 0.0508 4.3 5 11011 s 0.0567 4.1 4 0011 t 0.0706 3.8 4 1111 u 0.0334 4.9 5 10101 v 0.0069 7.2 8 11010001 w 0.0119 6.4 7 1101001 x 0.0073 7.1 7 1010001 y 0.0164 5.9 6 101001 z 0.0007 10.4 10 1101000001 – 0.1928 2.4 2 01 p f i o e t u y r l s − w v q m a n c d h g b k x j z x P (x) a 0.0575 b 0.0128 c 0.0263 d 0.0285 e 0.0913 f 0.0173 g 0.0133 h 0.0313 i 0.0599 j 0.0006 k 0.0084 l 0.0335 m 0.0235 n 0.0596 o 0.0689 p 0.0192 q 0.0008 r 0.0508 s 0.0567 t 0.0706 u 0.0334 v 0.0069 w 0.0119 x 0.0073 y 0.0164 z 0.0007 ! 0.1928 27 / 34 Huffman Coding: Formally HUFFMAN(A,P): 1 If |A| = 2 return C = {0, 1}; else 2 Let a, a′ ∈ A be least probable symbols. 3 Let A′ = A− {a, a′} ∪ {aa′} 4 Let P ′ = P − {pa, pa′} ∪ {paa′} where paa′ = pa + pa′ 5 Compute C′ = HUFFMAN(A′,P ′) 6 Define C by I c(a) = c′(aa′)0 I c(a′) = c′(aa′)1 I c(x) = c′(x) for x ∈ A′ 7 Return C 28 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 1 Start with A = {a, b, c} and P = {12 , 1 4 , 1 4} HUFFMAN(A,P): I b and c are least probable with pa = pb = 1 4 I A′ = {a,bc} and P ′ = { 12 , 1 2} I Call HUFFMAN(A′,P ′): |A| = |{a, bc}| = 2 Return code with c′(a) = 0, c′(bc) = 1 I Define c(b) = c′(bc)0 = 10 c(c) = c′(bc)1 = 11 c(a) = c′(a) = 0 I Return C = {0, 10, 11} The constructed code has L(C,X ) = 12 × 1 + 1 4 × (2 + 2) = 1.5. The entropy is H(X ) = 1.5. 29 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding: Example 2 Start with A = {a, b, c, d, e} and P = {0.25, 0.25, 0.2, 0.15, 0.15} HUFFMAN(A,P): I A′ = {a, b, c,de} and P ′ = {0.25, 0.25, 0.2, 0.3} I Call HUFFMAN(A′,P ′): A′′ = {a,bc, de} and P ′′ = {0.25, 0.45, 0.3} Call HUFFMAN(A′′,P ′′): - A′′′ = {ade, bc} and P ′′′ = {0.55, 0.45} - Return c′′′(ade) = 0, c′′′(bc) = 1 Return c′′(a) = 00, c′′(bc) = 1, c′′(de) = 01 I Return c′(a) = 00, c′(b) = 10, c′(c) = 11, c′(de) = 01 Return c(a) = 00, c(b) = 10, c(c) = 11, c(d) = 010, c(e) = 011 The constructed code is C = {00, 10, 11, 010, 011}. It has L(C,X ) = 2× (0.25 + 0.25 + 0.2) + 3× (0.15 + 0.15) = 2.3. Note that H(X ) ≈ 2.29. 30 / 34 Huffman Coding in Python See full example code with examples at: https://gist.github.com/mreid/fdf6353ec39d050e972b def huffman ( p ) : ’ ’ ’ Return a Huffman code f o r an ensemble wi th d i s t r i b u t i o n p . ’ ’ ’ asser t (sum( p . values ( ) ) == 1 .0 ) # Ensure p r o b a b i l i t i e s sum to 1 # Base case of on ly two symbols , assign 0 or 1 a r b i t r a r i l y i f ( len ( p ) == 2 ) : return dic t ( zip ( p . keys ( ) , [ ’ 0 ’ , ’ 1 ’ ] ) ) # Create a new d i s t r i b u t i o n by merging lowest prob . p a i r p prime = p . copy ( ) a1 , a2 = l o w es t p r o b p a i r ( p ) p1 , p2 = p prime . pop ( a1 ) , p pr ime . pop ( a2 ) p pr ime [ a1 + a2 ] = p1 + p2 # Recurse and cons t ruc t code on new d i s t r i b u t i o n c = huffman ( p pr ime ) ca1a2 = c . pop ( a1 + a2 ) c [ a1 ] , c [ a2 ] = ca1a2 + ’ 0 ’ , ca1a2 + ’ 1 ’ return c 31 / 34 https://gist.github.com/mreid/fdf6353ec39d050e972b Advantages of Huffman coding Produces prefix codes automatically (by design) Algorithm is simple and efficient Huffman Codes are provably optimal [Exercise 5.16 (MacKay)] If CHuff is a Huffman code, then for any other uniquely decodable code C′, L(CHuff,X ) ≤ L(C′,X ) It follows that H(X ) ≤ L(CHuff,X ) < H(X ) + 1 32 / 34 Advantages of Huffman coding Produces prefix codes automatically (by design) Algorithm is simple and efficient Huffman Codes are provably optimal [Exercise 5.16 (MacKay)] If CHuff is a Huffman code, then for any other uniquely decodable code C′, L(CHuff,X ) ≤ L(C′,X ) It follows that H(X ) ≤ L(CHuff,X ) < H(X ) + 1 32 / 34 Disadvantages of Huffman coding Assumes a fixed distribution of symbols The extra bit in the SCT I If H(X ) is large – not a problem I If H(X ) is small (e.g., ∼ 1 bit for English) codes are 2× optimal Huffman codes are the best possible symbol code but symbol coding is not always the best type of code Next Time: Stream Codes! 33 / 34 Disadvantages of Huffman coding Assumes a fixed distribution of symbols The extra bit in the SCT I If H(X ) is large – not a problem I If H(X ) is small (e.g., ∼ 1 bit for English) codes are 2× optimal Huffman codes are the best possible symbol code but symbol coding is not always the best type of code Next Time: Stream Codes! 33 / 34 Summary Key Concepts: 1 The expected code length L(C,X ) = ∑ i pi`i 2 Probabilities and codelengths are interchangeable qi = 2−`i ⇐⇒ `i = log2 1qi 3 Relative entropy DKL(p‖q) measures excess bits over the entropy H(X ) for using the wrong code q for probabilities p 4 The Source Coding Theorem for symbol codes: There exists prefix (Shannon) code C for ensemble X with `i = ⌈ log2 1 pi ⌉ so that H(X ) ≤ L(C,X ) ≤ H(X ) + 1 5 Huffman codes are optimal symbol codes Reading: §5.3-5.7 of MacKay §5.3-5.4, §5.6 & §5.8 of Cover & Thomas 34 / 34 Expected Code Length Minimising Expected Code Length Shannon Coding The Source Coding Theorem for Symbol Codes Huffman Coding Algorithm and Examples Advantages and Disadvantages