COMP2610 – Information Theory Lecture 11: Entropy and Coding
U Logo Use Guidelines Robert C. Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:
Copyright By PowCoder代写 加微信 powcoder
earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
27 August 2018
Brief Recap of Course (Last 6 Weeks)
How can we quantify information?
Basic Definitions and Key Concepts
Probability, Entropy & Information How can we make good guesses?
Probabilistic Inference
Bayes Theorem
How much redundancy can we safely remove?
Compression
Source Coding Theorem,
Block, Huffman, and Lempel-
How much noise can we correct and how?
Noisy-Channel Coding
Repetition Codes, Hamming Codes What is randomness?
Kolmogorov Complexity
Algorithmic Information Theory
Brief Overview of Course (Next 6 Weeks)
How can we quantify information?
Basic Definitions and Key Concepts
Probability, Entropy & Information How can we make good guesses?
Probabilistic Inference
Bayes Theorem
How much redundancy can we safely remove?
Compression
Source Coding Theorem,
Block, Huffman, and Lempel-
How much noise can we correct and how?
Noisy-Channel Coding
Repetition Codes, Hamming Codes What is randomness?
Kolmogorov Complexity
Algorithmic Information Theory
Brief Overview of Course (Next 6 Weeks)
How can we quantify information?
Basic Definitions and Key Concepts
Probability, Entropy & Information How can we make good guesses?
Probabilistic Inference
Bayes Theorem
How much redundancy can we safely remove?
Compression
Source Coding Theorem,
Block, Huffman, and Lempel-
How much noise can we correct and how?
Noisy-Channel Coding
Repetition Codes, Hamming Codes What is randomness?
Kolmogorov Complexity
Algorithmic Information Theory
Basic goal of compression
Key concepts: codes and their types, raw bit content, essential bit content Informal statement of source coding theorem
1 Introduction Overview
What is Compression?
A Communication Game What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review Defining Codes
3 Formalising Compression Reliability vs. Size
Key Result: The Source Coding Theorem
What is Compression?
Cn y rd ths mssg wtht ny vwls?
What is Compression?
Cn y rd ths mssg wtht ny vwls?
It is not too difficult to read as there is redundancy in English text. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
(Estimates of 1-1.5 bits per character, compared to log 26 ≈ 4.7) 2
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
de de fg fg
Probability, Entropy, and Inference
Figure 2.3. Conditional
to be followed with a “u”
second letter, y, given the first
P (x | y): Each column shows the
first letter, x, given the second
letter, y. Compression exploits
differences in relative probability of symbols or blocks of symbols
wx wx yy zz
abcdefghijklmnopqrstuvwxyz y
that the Þrst letter x is q are u and -à The space is common after q because the source document makes heavy use of the word FAQà
probability distributions. (a) Ifyouseea“q”P,(yi|tx)i:sEavcehroywslhikowesltyhe
conditional distribution of the
letter, x, in a bigram xy. (b) The letter “e” is much more
common than “j”
abcdefghijklmnopqrstuvwxyz y
conditional distribution of the
Compression in a Nutshell
Compression
Data compression is the process of replacing a message with a smaller message which can be reliably converted back to the original.
A General Communication Game
Imagine the following game between Sender & Receiver:
Sender & Receiver agree on code for each outcome ahead of time (e.g., 0 for Heads; 1 for Tails)
Sender observes outcomes then codes and sends message Receiver decodes message and recovers outcome sequence
Goal: Want small messages on average when outcomes are from a fixed, known, but uncertain source (e.g., coin flips with known bias)
Heads, Tails, Heads, …
Heads, Tails, Heads, …
Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission: Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome Not very interesting!
Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission: Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome Not very interesting!
Things get interesting if we:
accept errors in transmission allow variable length messages
Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission: Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome Not very interesting!
Things get interesting if we:
accept errors in transmission (this week) allow variable length messages (next week)
Sneak peek: source coding theorem
If we are happy to fail on up to 2% of the sequences we can ignore any sequence of 10 outcomes with more than 3 tails
Why? The number of tails follows a Binomial(10, 0.1) distribution
Sneak peek: source coding theorem
If we are happy to fail on up to 2% of the sequences we can ignore any sequence of 10 outcomes with more than 3 tails
Why? The number of tails follows a Binomial(10, 0.1) distribution
There are only 176 < 28 sequences with 3 or fewer tails
So, we can just code those, and ignore the rest!
Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8 bits/outcome
Smallest bits/outcome needed for 10,000 outcome sequences?
Generalisation: Source Coding Theorem
What happens when we generalise to arbitrary error probability, and sequence size?
Generalisation: Source Coding Theorem
What happens when we generalise to arbitrary error probability, and sequence size?
Source Coding Theorem (Informal Statement)
If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source
Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source.
Generalisation: Source Coding Theorem
What happens when we generalise to arbitrary error probability, and sequence size?
Source Coding Theorem (Informal Statement)
If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source
Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source.
To define: “Uniformly code”, “large sequences”, “degree of reliability”, “average number of bits per outcome”, “roughly equal”
1 Introduction Overview
What is Compression?
A Communication Game What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review Defining Codes
3 Formalising Compression Reliability vs. Size
Key Result: The Source Coding Theorem
Entropy and Information: Recap
An ensemble X is a triple (x , AX , PX ); x is a random variable taking values in AX = {a1,a2,...,aI} with probabilities PX = {p1,p2,...,pI}.
Entropy and Information: Recap
An ensemble X is a triple (x , AX , PX ); x is a random variable taking values in AX = {a1,a2,...,aI} with probabilities PX = {p1,p2,...,pI}.
Information
The information in the observation that x = ai (in the ensemble X) is
h(a)=log 1 =−log p i 2pi 2i
Entropy and Information: Recap
An ensemble X is a triple (x , AX , PX ); x is a random variable taking values in AX = {a1,a2,...,aI} with probabilities PX = {p1,p2,...,pI}.
Information
The information in the observation that x = ai (in the ensemble X) is h(a)=log 1 =−log p
The entropy of an ensemble X is the average information
H(X)=E[h(x)]=ph(a)=p log 1 ii i2pi
What is a Code?
A source code is a process for assigning names to outcomes. The names
are typically expressed by strings of binary symbols. We will denote the set of all finite binary strings by
{0,1} = {0,1,00,01,10,...}
Source Code
Given an ensemble X , the function c : AX → {0, 1}+ is a source code for X. The number of symbols in c(x) is the length l(x) of the codeword for x. The extension of c is defined by c(x1 ...xn) = c(x1)...c(xn)
What is a Code?
A source code is a process for assigning names to outcomes. The names
are typically expressed by strings of binary symbols. We will denote the set of all finite binary strings by
{0,1} = {0,1,00,01,10,...}
Source Code
Given an ensemble X , the function c : AX → {0, 1}+ is a source code for X. The number of symbols in c(x) is the length l(x) of the codeword for x. The extension of c is defined by c(x1 ...xn) = c(x1)...c(xn)
ThecodecnamesoutcomesfromAX ={r,g,b}by
c(r) = 00, c(g) = 10, c(b) = 11
The length of the codeword for each outcome is 2. The extension of c gives c(rgrb) = 00100011
Types of Codes
LetXbeanensembleandc:AX →{0,1}+acodeforX.Wesaycisa:
Uniform Code if l(x) is the same for all x ∈ AX Variable-Length Code otherwise
Types of Codes
LetXbeanensembleandc:AX →{0,1}+acodeforX.Wesaycisa: Uniform Code if l(x) is the same for all x ∈ AX
Variable-Length Code otherwise
Another important criteria for codes is whether the original symbol x can
be unambiguously determined given c(x). We say c is a:
Lossless Code if for all x1, x2 ∈ AX we have x1 ̸= x2 implies c(x1) ̸= c(x2)
Lossy Code otherwise
Types of Codes Examples
Examples:LetAX ={a,b,c,d}
1 c(a) = 00, c(b) = 01, c(c) = 10, c(d) = 11 is uniform lossless
2 c(a) = 0, c(b) = 10, c(c) = 110, c(d) = 111 is variable-length lossless
3 c(a) = 0, c(b) = 0, c(c) = 110, c(d) = 111 is variable-length lossy
4 c(a) = 00, c(b) = 00, c(c) = 10, c(d) = 11 is uniform lossy
c(a) = −, c(b) = −, c(c) = 10, c(d) = 11 is uniform lossy
A Note on Lossy Codes & Missing Codewords
WhentalkingaboutauniformlossycodecforAX ={a,b,c}wewrite c(a)=0 c(b)=1 c(c)=-
where the symbol - means “no codeword”. This is shorthand for “the receiver will decode this codeword incorrectly”
For the purposes of these lectures, this is equivalent to the code
c(a)=0 c(b)=1 c(c)=1
and the sender and receiver agreeing that the codeword 1 should always
be decoded as b
Assigning the outcome ai the missing codeword “-” just means “it is not possible to send ai reliably”
1 Introduction Overview
What is Compression?
A Communication Game What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review Defining Codes
3 Formalising Compression Reliability vs. Size
Key Result: The Source Coding Theorem
Lossless Coding Example: AX ={r,g,b} with r twice as likely as b or g
pr =0.5andpg =pb =0.25.
Suppose we use the following uniform lossless code c(r) = 00; c(g) = 10; and c(b) = 11
For example c(rrgbrbr) = 00001011001100 will have 14 bits.
On average, we will use l(r)pr + l(g)pg + l(b)pb = 2 bits per outcome 2N bits to code a sequence of N outcomes
Raw Bit Content
Uniform coding gives a crude measure of information: the number of bits required to assign equal length codes to each symbol
Raw Bit Content
If X is an ensemble with outcome set AX then its raw bit content is H0(X) = log2 |AX|.
This is a uniform encoding of outcomes in AX = {a,b,c,d,e,f,g,h}:
Each outcome is encoded using H0(X) = 3 bits The probabilities of the outcomes are ignored Same as assuming a uniform distribution
For the purposes of compression, the exact codes don’t matter – just the number of bits used.
Lossy Coding Example: AX ={r,g,b} pr =0.5andpg =pb =0.25.
Using uniform lossy code:
c(r) = 0; c(g) = −; and c(b) = 1
c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = −
Lossy Coding Example: AX ={r,g,b} pr =0.5andpg =pb =0.25.
Using uniform lossy code:
c(r) = 0; c(g) = −; and c(b) = 1
c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = −
What is probability we can reliably code a sequence of N outcomes? Given we can code a sequence of length N, how many bits are expected?
Lossy Coding Example: Colours
What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N
Lossy Coding Example: Colours
What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1)+···+l(XN)|X1 ̸=g,...,XN ̸=g]= =N (l(r)pr + l(b)pb) /(1 − pg) = N
sincel(pr)=l(pb)=1andpr +pb =1−pg. c.f. 2N bits with lossless code
E[l(Xn)|Xn ̸=g]
Lossy Coding Example: Colours
What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1)+···+l(XN)|X1 ̸=g,...,XN ̸=g]= =N (l(r)pr + l(b)pb) /(1 − pg) = N
sincel(pr)=l(pb)=1andpr +pb =1−pg. c.f. 2N bits with lossless code
E[l(Xn)|Xn ̸=g]
Lossy Coding Example: Colours
What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N
Given we can code a sequence of length N, how many bits are expected?
E[l(X1)+···+l(XN)|X1 ̸=g,...,XN ̸=g]=
=N (l(r)pr + l(b)pb) /(1 − pg) = N = N log2 |{r, b}|
sincel(pr)=l(pb)=1andpr +pb =1−pg. c.f. 2N bits with lossless code
E[l(Xn)|Xn ̸=g]
Essential Bit Content
There is an inherent trade off between the number of bits required in a
uniform lossy code and the probability of being able to code an outcome Smallest δ-sufficient subset
LetX beanensembleandfor0≤δ≤1,defineSδ tobethesmallest
subset of AX such that
For small δ, smallest collection of most likely outcomes
P(x ∈ Sδ) ≥ 1 − δ
Essential Bit Content
There is an inherent trade off between the number of bits required in a
uniform lossy code and the probability of being able to code an outcome Smallest δ-sufficient subset
LetX beanensembleandfor0≤δ≤1,defineSδ tobethesmallest
subset of AX such that
For small δ, smallest collection of most likely outcomes
P(x ∈ Sδ) ≥ 1 − δ
If we uniformly code elements in Sδ, and ignore all others:
We can code a sequence of length N with probability (1 − δ)N If we can code a sequence, its expected length is N log2 |Sδ|
Essential Bit Content Example
Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold
x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64
Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ
δ = 0 : Sδ = {a,b,c,d,e,f,g,h}
Essential Bit Content Example
Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold
x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64
Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ
δ = 0 : Sδ = {a,b,c,d,e,f,g,h} δ = 1/64 : Sδ = {a,b,c,d,e,f,g}
Essential Bit Content Example
Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold
Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ
δ = 0 : Sδ = {a,b,c,d,e,f,g,h} δ = 1/64 : Sδ = {a,b,c,d,e,f,g}
δ = 1/16 : Sδ = {a,b,c,d}
Essential Bit Content Example
Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold
x P(x) a 1/4
Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ
δ = 0 : Sδ = {a,b,c,d,e,f,g,h} δ = 1/64 : Sδ = {a,b,c,d,e,f,g}
δ = 1/16 : Sδ = {a,b,c,d}
δ=3/4 :Sδ ={a}
Essential Bit Content
Trade off between a probability of δ of not coding an outcome and size of uniform code is captured by the essential bit content
Essential Bit Content
For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is
Hδ(X) = log2 |Sδ|
Essential Bit Content
El o g 2 P x Trade off between a probability of δ of not coding an outcome and size of
!6 !4 !2à4 !2 uniform code is captured by the essential bit content
Sà Essential Bit Content
For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is TTT
Hδ(X) = log2 |Sδ|
x Pa(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64
3 2.5 2 1.5 1 0.5
{a,b,c,d,e,f,g,h} {a,b,c,d,e,f,g} {a,b,c,d,e,f} {a,b,c,d,e}
0.3 0.4 0.5 0.6 0.7 0.8
The Source Coding Theorem for Uniform Codes (Theorem 4.1 in MacKay)
Our aim next time is to understand this:
The Source Coding Theorem for Uniform Codes
Let X be an ensemble with entropy H = H(X) bits. Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that for all N > N0
1H XN−H<ε. Nδ
The Source Coding Theorem for Uniform Codes (Theorem 4.1 in MacKay)
Our aim next time is to understand this:
The Source Coding Theorem for Uniform Codes
Let X be an ensemble with entropy H = H(X) bits. Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that for all N > N0
1H XN−H<ε. Nδ
The term N1 Hδ (X N ) is the average number of bits required to uniformly code all but a proportion δ of the symbols.
Given a tiny probability of error δ, the average bits per symbol can be made as close to H as required.
Even if we allow a large probability of error we cannot compress more than H bits ber symbol.
Some Intuition for the SCT
Don’t code individual symbols in an ensemble; rather, consider sequences of length N.
As length of sequence increases, the probability of seeing a “typical” sequence becomes much larger than “atypical” sequences.
Thus, we can get by with essentially assigning a unique codeword to each typical sequence
Recap: typical sets
Formal statement of source coding theorem Proof of source coding theorem
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com