计算机代考 COMP2610 – Information Theory Lecture 11: Entropy and Coding

COMP2610 – Information Theory Lecture 11: Entropy and Coding
U Logo Use Guidelines Robert C. Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:

Copyright By PowCoder代写 加微信 powcoder

earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
27 August 2018

Brief Recap of Course (Last 6 Weeks)
How can we quantify information?
􏰜 Basic Definitions and Key Concepts
􏰜 Probability, Entropy & Information How can we make good guesses?
􏰜 Probabilistic Inference
􏰜 Bayes Theorem
How much redundancy can we safely remove?
􏰜 Compression
􏰜 Source Coding Theorem,
􏰜 Block, Huffman, and Lempel-
How much noise can we correct and how?
􏰜 Noisy-Channel Coding
􏰜 Repetition Codes, Hamming Codes What is randomness?
􏰜 Kolmogorov Complexity
􏰜 Algorithmic Information Theory

Brief Overview of Course (Next 6 Weeks)
How can we quantify information?
􏰜 Basic Definitions and Key Concepts
􏰜 Probability, Entropy & Information How can we make good guesses?
􏰜 Probabilistic Inference
􏰜 Bayes Theorem
How much redundancy can we safely remove?
􏰜 Compression
􏰜 Source Coding Theorem,
􏰜 Block, Huffman, and Lempel-
How much noise can we correct and how?
􏰜 Noisy-Channel Coding
􏰜 Repetition Codes, Hamming Codes What is randomness?
􏰜 Kolmogorov Complexity
􏰜 Algorithmic Information Theory

Brief Overview of Course (Next 6 Weeks)
How can we quantify information?
􏰜 Basic Definitions and Key Concepts
􏰜 Probability, Entropy & Information How can we make good guesses?
􏰜 Probabilistic Inference
􏰜 Bayes Theorem
How much redundancy can we safely remove?
􏰜 Compression
􏰜 Source Coding Theorem,
􏰜 Block, Huffman, and Lempel-
How much noise can we correct and how?
􏰜 Noisy-Channel Coding
􏰜 Repetition Codes, Hamming Codes What is randomness?
􏰜 Kolmogorov Complexity
􏰜 Algorithmic Information Theory

Basic goal of compression
Key concepts: codes and their types, raw bit content, essential bit content Informal statement of source coding theorem

1 Introduction Overview
What is Compression?
A Communication Game What’s the best we can do?
2 Formalising Coding
Entropy and Information: A Quick Review Defining Codes
3 Formalising Compression Reliability vs. Size
Key Result: The Source Coding Theorem

What is Compression?
Cn y rd ths mssg wtht ny vwls?

What is Compression?
Cn y rd ths mssg wtht ny vwls?
It is not too difficult to read as there is redundancy in English text. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
(Estimates of 1-1.5 bits per character, compared to log 26 ≈ 4.7) 2
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
de de fg fg
Probability, Entropy, and Inference
Figure 2.3. Conditional
to be followed with a “u”
second letter, y, given the first
P (x | y): Each column shows the
first letter, x, given the second
letter, y. Compression exploits
differences in relative probability of symbols or blocks of symbols
wx wx yy zz
abcdefghijklmnopqrstuvwxyz y
that the Þrst letter x is q are u and -à The space is common after q because the source document makes heavy use of the word FAQà
probability distributions. (a) Ifyouseea“q”P,(yi|tx)i:sEavcehroywslhikowesltyhe
conditional distribution of the
letter, x, in a bigram xy. (b) The letter “e” is much more
common than “j”
abcdefghijklmnopqrstuvwxyz y
conditional distribution of the

Compression in a Nutshell
Compression
Data compression is the process of replacing a message with a smaller message which can be reliably converted back to the original.

A General Communication Game
Imagine the following game between Sender & Receiver:
Sender & Receiver agree on code for each outcome ahead of time (e.g., 0 for Heads; 1 for Tails)
Sender observes outcomes then codes and sends message Receiver decodes message and recovers outcome sequence
Goal: Want small messages on average when outcomes are from a fixed, known, but uncertain source (e.g., coin flips with known bias)
Heads, Tails, Heads, …
Heads, Tails, Heads, …

Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission: Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome Not very interesting!

Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission: Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome Not very interesting!
Things get interesting if we:
accept errors in transmission allow variable length messages

Sneak peek: source coding theorem
Consider a coin with P(Heads) = 0.9. If we want perfect transmission: Coding single outcomes requires 1 bit/outcome
Coding 10 outcomes at a time needs 10 bits, or 1 bit/outcome Not very interesting!
Things get interesting if we:
accept errors in transmission (this week) allow variable length messages (next week)

Sneak peek: source coding theorem
If we are happy to fail on up to 2% of the sequences we can ignore any sequence of 10 outcomes with more than 3 tails
Why? The number of tails follows a Binomial(10, 0.1) distribution

Sneak peek: source coding theorem
If we are happy to fail on up to 2% of the sequences we can ignore any sequence of 10 outcomes with more than 3 tails
Why? The number of tails follows a Binomial(10, 0.1) distribution
There are only 176 < 28 sequences with 3 or fewer tails So, we can just code those, and ignore the rest! Coding 10 outcomes with 2% failure doable with 8 bits, or 0.8 bits/outcome Smallest bits/outcome needed for 10,000 outcome sequences? Generalisation: Source Coding Theorem What happens when we generalise to arbitrary error probability, and sequence size? Generalisation: Source Coding Theorem What happens when we generalise to arbitrary error probability, and sequence size? Source Coding Theorem (Informal Statement) If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source. Generalisation: Source Coding Theorem What happens when we generalise to arbitrary error probability, and sequence size? Source Coding Theorem (Informal Statement) If: you want to uniformly code large sequences of outcomes with any degree of reliability from a random source Then: the average number of bits per outcome you will need is roughly equal to the entropy of that source. To define: “Uniformly code”, “large sequences”, “degree of reliability”, “average number of bits per outcome”, “roughly equal” 1 Introduction Overview What is Compression? A Communication Game What’s the best we can do? 2 Formalising Coding Entropy and Information: A Quick Review Defining Codes 3 Formalising Compression Reliability vs. Size Key Result: The Source Coding Theorem Entropy and Information: Recap An ensemble X is a triple (x , AX , PX ); x is a random variable taking values in AX = {a1,a2,...,aI} with probabilities PX = {p1,p2,...,pI}. Entropy and Information: Recap An ensemble X is a triple (x , AX , PX ); x is a random variable taking values in AX = {a1,a2,...,aI} with probabilities PX = {p1,p2,...,pI}. Information The information in the observation that x = ai (in the ensemble X) is h(a)=log 1 =−log p i 2pi 2i Entropy and Information: Recap An ensemble X is a triple (x , AX , PX ); x is a random variable taking values in AX = {a1,a2,...,aI} with probabilities PX = {p1,p2,...,pI}. Information The information in the observation that x = ai (in the ensemble X) is h(a)=log 1 =−log p The entropy of an ensemble X is the average information H(X)=E[h(x)]=􏰄ph(a)=􏰄p log 1 ii i2pi What is a Code? A source code is a process for assigning names to outcomes. The names are typically expressed by strings of binary symbols. We will denote the set of all finite binary strings by {0,1} = {0,1,00,01,10,...} Source Code Given an ensemble X , the function c : AX → {0, 1}+ is a source code for X. The number of symbols in c(x) is the length l(x) of the codeword for x. The extension of c is defined by c(x1 ...xn) = c(x1)...c(xn) What is a Code? A source code is a process for assigning names to outcomes. The names are typically expressed by strings of binary symbols. We will denote the set of all finite binary strings by {0,1} = {0,1,00,01,10,...} Source Code Given an ensemble X , the function c : AX → {0, 1}+ is a source code for X. The number of symbols in c(x) is the length l(x) of the codeword for x. The extension of c is defined by c(x1 ...xn) = c(x1)...c(xn) ThecodecnamesoutcomesfromAX ={r,g,b}by c(r) = 00, c(g) = 10, c(b) = 11 The length of the codeword for each outcome is 2. The extension of c gives c(rgrb) = 00100011 Types of Codes LetXbeanensembleandc:AX →{0,1}+acodeforX.Wesaycisa: Uniform Code if l(x) is the same for all x ∈ AX Variable-Length Code otherwise Types of Codes LetXbeanensembleandc:AX →{0,1}+acodeforX.Wesaycisa: Uniform Code if l(x) is the same for all x ∈ AX Variable-Length Code otherwise Another important criteria for codes is whether the original symbol x can be unambiguously determined given c(x). We say c is a: Lossless Code if for all x1, x2 ∈ AX we have x1 ̸= x2 implies c(x1) ̸= c(x2) Lossy Code otherwise Types of Codes Examples Examples:LetAX ={a,b,c,d} 1 c(a) = 00, c(b) = 01, c(c) = 10, c(d) = 11 is uniform lossless 2 c(a) = 0, c(b) = 10, c(c) = 110, c(d) = 111 is variable-length lossless 3 c(a) = 0, c(b) = 0, c(c) = 110, c(d) = 111 is variable-length lossy 4 c(a) = 00, c(b) = 00, c(c) = 10, c(d) = 11 is uniform lossy c(a) = −, c(b) = −, c(c) = 10, c(d) = 11 is uniform lossy A Note on Lossy Codes & Missing Codewords WhentalkingaboutauniformlossycodecforAX ={a,b,c}wewrite c(a)=0 c(b)=1 c(c)=- where the symbol - means “no codeword”. This is shorthand for “the receiver will decode this codeword incorrectly” For the purposes of these lectures, this is equivalent to the code c(a)=0 c(b)=1 c(c)=1 and the sender and receiver agreeing that the codeword 1 should always be decoded as b Assigning the outcome ai the missing codeword “-” just means “it is not possible to send ai reliably” 1 Introduction Overview What is Compression? A Communication Game What’s the best we can do? 2 Formalising Coding Entropy and Information: A Quick Review Defining Codes 3 Formalising Compression Reliability vs. Size Key Result: The Source Coding Theorem Lossless Coding Example: AX ={r,g,b} with r twice as likely as b or g pr =0.5andpg =pb =0.25. Suppose we use the following uniform lossless code c(r) = 00; c(g) = 10; and c(b) = 11 For example c(rrgbrbr) = 00001011001100 will have 14 bits. On average, we will use l(r)pr + l(g)pg + l(b)pb = 2 bits per outcome 2N bits to code a sequence of N outcomes Raw Bit Content Uniform coding gives a crude measure of information: the number of bits required to assign equal length codes to each symbol Raw Bit Content If X is an ensemble with outcome set AX then its raw bit content is H0(X) = log2 |AX|. This is a uniform encoding of outcomes in AX = {a,b,c,d,e,f,g,h}: Each outcome is encoded using H0(X) = 3 bits The probabilities of the outcomes are ignored Same as assuming a uniform distribution For the purposes of compression, the exact codes don’t matter – just the number of bits used. Lossy Coding Example: AX ={r,g,b} pr =0.5andpg =pb =0.25. Using uniform lossy code: c(r) = 0; c(g) = −; and c(b) = 1 c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = − Lossy Coding Example: AX ={r,g,b} pr =0.5andpg =pb =0.25. Using uniform lossy code: c(r) = 0; c(g) = −; and c(b) = 1 c(rrrrrrr) = 0000000; c(rrbbrbr) = 0011010; c(rrgbrbr) = − What is probability we can reliably code a sequence of N outcomes? Given we can code a sequence of length N, how many bits are expected? Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1)+···+l(XN)|X1 ̸=g,...,XN ̸=g]= =N (l(r)pr + l(b)pb) /(1 − pg) = N sincel(pr)=l(pb)=1andpr +pb =1−pg. c.f. 2N bits with lossless code E[l(Xn)|Xn ̸=g] Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1)+···+l(XN)|X1 ̸=g,...,XN ̸=g]= =N (l(r)pr + l(b)pb) /(1 − pg) = N sincel(pr)=l(pb)=1andpr +pb =1−pg. c.f. 2N bits with lossless code E[l(Xn)|Xn ̸=g] Lossy Coding Example: Colours What is probability we can reliably code a sequence of N outcomes? P(x1...xN hasnog)=P(x1 ̸=g)...P(xN ̸=g)=(1−pg)N Given we can code a sequence of length N, how many bits are expected? E[l(X1)+···+l(XN)|X1 ̸=g,...,XN ̸=g]= =N (l(r)pr + l(b)pb) /(1 − pg) = N = N log2 |{r, b}| sincel(pr)=l(pb)=1andpr +pb =1−pg. c.f. 2N bits with lossless code E[l(Xn)|Xn ̸=g] Essential Bit Content There is an inherent trade off between the number of bits required in a uniform lossy code and the probability of being able to code an outcome Smallest δ-sufficient subset LetX beanensembleandfor0≤δ≤1,defineSδ tobethesmallest subset of AX such that For small δ, smallest collection of most likely outcomes P(x ∈ Sδ) ≥ 1 − δ Essential Bit Content There is an inherent trade off between the number of bits required in a uniform lossy code and the probability of being able to code an outcome Smallest δ-sufficient subset LetX beanensembleandfor0≤δ≤1,defineSδ tobethesmallest subset of AX such that For small δ, smallest collection of most likely outcomes P(x ∈ Sδ) ≥ 1 − δ If we uniformly code elements in Sδ, and ignore all others: We can code a sequence of length N with probability (1 − δ)N If we can code a sequence, its expected length is N log2 |Sδ| Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ δ = 0 : Sδ = {a,b,c,d,e,f,g,h} Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold x P(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ δ = 0 : Sδ = {a,b,c,d,e,f,g,h} δ = 1/64 : Sδ = {a,b,c,d,e,f,g} Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ δ = 0 : Sδ = {a,b,c,d,e,f,g,h} δ = 1/64 : Sδ = {a,b,c,d,e,f,g} δ = 1/16 : Sδ = {a,b,c,d} Essential Bit Content Example Intuitively, construct Sδ by removing elements of X in ascending order of probability, till we have reached the 1 − δ threshold x P(x) a 1/4 Outcomes ranked (high–low) by P(x = ai) removed to make set Sδ with P(x ∈ Sδ) ≥ 1−δ δ = 0 : Sδ = {a,b,c,d,e,f,g,h} δ = 1/64 : Sδ = {a,b,c,d,e,f,g} δ = 1/16 : Sδ = {a,b,c,d} δ=3/4 :Sδ ={a} Essential Bit Content Trade off between a probability of δ of not coding an outcome and size of uniform code is captured by the essential bit content Essential Bit Content For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is Hδ(X) = log2 |Sδ| Essential Bit Content El o g 2 P x Trade off between a probability of δ of not coding an outcome and size of !6 !4 !2à4 !2 uniform code is captured by the essential bit content Sà Essential Bit Content For an ensemble X and 0 ≤ δ ≤ 1, the essential bit content of X is TTT Hδ(X) = log2 |Sδ| x Pa(x) a 1/4 b 1/4 c 1/4 d 3/16 e 1/64 f 1/64 g 1/64 h 1/64 3 2.5 2 1.5 1 0.5 {a,b,c,d,e,f,g,h} {a,b,c,d,e,f,g} {a,b,c,d,e,f} {a,b,c,d,e} 0.3 0.4 0.5 0.6 0.7 0.8 The Source Coding Theorem for Uniform Codes (Theorem 4.1 in MacKay) Our aim next time is to understand this: The Source Coding Theorem for Uniform Codes Let X be an ensemble with entropy H = H(X) bits. Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that for all N > N0
􏰇􏰇􏰇1H 􏰀XN􏰁−H􏰇􏰇􏰇<ε. 􏰇Nδ 􏰇 The Source Coding Theorem for Uniform Codes (Theorem 4.1 in MacKay) Our aim next time is to understand this: The Source Coding Theorem for Uniform Codes Let X be an ensemble with entropy H = H(X) bits. Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that for all N > N0
􏰇􏰇􏰇1H 􏰀XN􏰁−H􏰇􏰇􏰇<ε. 􏰇Nδ 􏰇 The term N1 Hδ (X N ) is the average number of bits required to uniformly code all but a proportion δ of the symbols. Given a tiny probability of error δ, the average bits per symbol can be made as close to H as required. Even if we allow a large probability of error we cannot compress more than H bits ber symbol. Some Intuition for the SCT Don’t code individual symbols in an ensemble; rather, consider sequences of length N. As length of sequence increases, the probability of seeing a “typical” sequence becomes much larger than “atypical” sequences. Thus, we can get by with essentially assigning a unique codeword to each typical sequence Recap: typical sets Formal statement of source coding theorem Proof of source coding theorem 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com