程序代写代做代考 scheme information theory algorithm COMP2610/6261 - Information Theory - Lecture 20: Joint-Typicality and the Noisy-Channel Coding Theorem

COMP2610/6261 – Information Theory – Lecture 20: Joint-Typicality and the Noisy-Channel Coding Theorem

COMP2610/6261 – Information Theory
Lecture 20: Joint-Typicality and the Noisy-Channel Coding Theorem

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

October 24th, 2018

1 / 34

Channel Capacity: Recap

The largest possible reduction in uncertainty achievable across a channel
is its capacity

Channel Capacity
The capacity C of a channel Q is the largest mutual information between
its input and output for any choice of input ensemble. That is,

C = max
pX

I(X ;Y )

2 / 34

Block Codes: Recap

(N,K ) Block Code
Given a channel Q with inputs X and outputs Y , an integer N > 0, and
K > 0, an (N,K ) Block Code for Q is a list of S = 2K codewords

C = {x(1), x(2), . . . , x(2K )}

where each x(s) ∈ XN consists of N symbols from X .

Rate of a block code is KN =
log2 S

3 / 34

Reliability: Recap

QEncoder
x ysin sout

Decoder

Probability of (Block) Error
Given a channel Q the probability of (block) error for a code is

pB = P(sout 6= sin) =
∑

sin

P(sout 6= sin|sin)P(sin)

and its maximum probability of (block) error is

pBM = max
sin

P(sout 6= sin|sin)

4 / 34

The Noisy-Channel Coding Theorem: Recap
Informal Statement

Recall that a rate R is achievable if there is a block code with this rate and
arbitrarily small error probability

We highlighted the following remarkable result:

Noisy-Channel Coding Theorem (Informal)
If Q is a channel with capacity C then the rate R is achievable if and only
if R ≤ C, that is, the rate is no greater than the channel capacity.

Ideally, we would like to know:

Can we go above C if we allow some fixed probability of error?

Is there a maximal rate for a fixed probability of error?

5 / 34

The Noisy-Channel Coding Theorem: Recap
Informal Statement

Recall that a rate R is achievable if there is a block code with this rate and
arbitrarily small error probability

We highlighted the following remarkable result:

Noisy-Channel Coding Theorem (Informal)
If Q is a channel with capacity C then the rate R is achievable if and only
if R ≤ C, that is, the rate is no greater than the channel capacity.

Ideally, we would like to know:

Can we go above C if we allow some fixed probability of error?

Is there a maximal rate for a fixed probability of error?

5 / 34

1 Noisy-Channel Coding Theorem

2 Joint Typicality

3 Proof Sketch of the NCCT

4 Good Codes vs. Practical Codes

5 Linear Codes

6 / 34

The Noisy-Channel Coding Theorem
Formal Statement

Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < � The Noisy-Channel Coding Theorem (Formal) 1 Any rate R < C is achievable for Q 2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes exists with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any p, we cannot achieve a rate greater than R(p) with probability of bit error p. Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so we cannot achieve rate greater than C with probability of bit error arbitrarily small 7 / 34 The Noisy-Channel Coding Theorem Formal Statement Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < � The Noisy-Channel Coding Theorem (Formal) 1 Any rate R < C is achievable for Q 2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes exists with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any p, we cannot achieve a rate greater than R(p) with probability of bit error p. Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so we cannot achieve rate greater than C with probability of bit error arbitrarily small 7 / 34 The Noisy-Channel Coding Theorem Formal Statement Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < � The Noisy-Channel Coding Theorem (Formal) 1 Any rate R < C is achievable for Q 2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes exists with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any p, we cannot achieve a rate greater than R(p) with probability of bit error p. Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so we cannot achieve rate greater than C with probability of bit error arbitrarily small 7 / 34 The Noisy-Channel Coding Theorem Formal Statement Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < � The Noisy-Channel Coding Theorem (Formal) 1 Any rate R < C is achievable for Q 2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes exists with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any p, we cannot achieve a rate greater than R(p) with probability of bit error p. Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so we cannot achieve rate greater than C with probability of bit error arbitrarily small 7 / 34 The Noisy-Channel Coding Theorem Formal Statement Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < � The Noisy-Channel Coding Theorem (Formal) 1 Any rate R < C is achievable for Q 2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes exists with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any p, we cannot achieve a rate greater than R(p) with probability of bit error p. Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so we cannot achieve rate greater than C with probability of bit error arbitrarily small 7 / 34 The Noisy-Channel Coding Theorem Formal Statement Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < � The Noisy-Channel Coding Theorem (Formal) 1 Any rate R < C is achievable for Q 2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes exists with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any p, we cannot achieve a rate greater than R(p) with probability of bit error p. Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so we cannot achieve rate greater than C with probability of bit error arbitrarily small 7 / 34 Implications of NCCT Suppose we know a channel has capacity 0.6 bits We cannot achieve a rate of 0.8 with arbitrarily small error We can achieve a rate of 0.8 with probability of bit error 5%, since 0.6 1−H2(0.05) = 0.8408 > 0.8

8 / 34

Implications of NCCT

Suppose we know a channel has capacity 0.6 bits

We cannot achieve a rate of 0.8 with arbitrarily small error

We can achieve a rate of 0.8 with probability of bit error 5%, since
0.6

1−H2(0.05) = 0.8408 > 0.8

8 / 34

Implications of NCCT

Suppose we know a channel has capacity 0.6 bits

We cannot achieve a rate of 0.8 with arbitrarily small error

We can achieve a rate of 0.8 with probability of bit error 5%, since
0.6

1−H2(0.05) = 0.8408 > 0.8

8 / 34

1 Noisy-Channel Coding Theorem

2 Joint Typicality

3 Proof Sketch of the NCCT

4 Good Codes vs. Practical Codes

5 Linear Codes

9 / 34

Joint Typicality

Recall that a random variable z from Z N is typical for an ensemble Z
whenever its average symbol information is within β of the entropy H(Z )

∣∣∣∣−
1
N

log2 P(z)− H(Z )
∣∣∣∣ < β Joint Typicality A pair of sequences x ∈ ANX and y ∈ ANY , each of length N, are jointly typical (to tolerance β) for distribution P(x , y) if 1 x is typical of P(x) [z = x above] 2 y is typical of P(y) [z = y above] 3 (x, y) is typical of P(x, y) [z = (x, y) above] The jointly typical set of all such pairs is denoted JNβ . 10 / 34 Joint Typicality Recall that a random variable z from Z N is typical for an ensemble Z whenever its average symbol information is within β of the entropy H(Z ) ∣∣∣∣− 1 N log2 P(z)− H(Z ) ∣∣∣∣ < β Joint Typicality A pair of sequences x ∈ ANX and y ∈ ANY , each of length N, are jointly typical (to tolerance β) for distribution P(x , y) if 1 x is typical of P(x) [z = x above] 2 y is typical of P(y) [z = y above] 3 (x, y) is typical of P(x, y) [z = (x, y) above] The jointly typical set of all such pairs is denoted JNβ . 10 / 34 Joint Typicality Example Example (pX = (0.9, 0.1) and BSC with f = 0.2): Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.2: Jointly-typical sequences 163 The jointly-typical set JN! is the set of all jointly-typical sequence pairs of length N . Example. Here is a jointly-typical pair of length N = 100 for the ensemble P (x, y) in which P (x) has (p0, p1) = (0.9, 0.1) and P (y |x) corresponds to a binary symmetric channel with noise level 0.2. x 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 y 0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111 Notice that x has 10 1s, and so is typical of the probability P (x) (at any tolerance !); and y has 26 1s, so it is typical of P (y) (because P (y =1) = 0.26); and x and y di!er in 20 bits, which is the typical number of flips for this channel. Joint typicality theorem. Let x,y be drawn from the ensemble (XY )N defined by P (x,y) = N! n=1 P (xn, yn). Then 1. the probability that x,y are jointly typical (to tolerance !) tends to 1 as N ! "; 2. the number of jointly-typical sequences |JN! | is close to 2NH(X,Y ). To be precise, |JN! | # 2N(H(X,Y )+!); (10.3) 3. if x! $ XN and y! $ Y N , i.e., x! and y! are independent samples with the same marginal distribution as P (x,y), then the probability that (x!,y!) lands in the jointly-typical set is about 2"NI(X;Y ). To be precise, P ((x!,y!) % JN!) # 2"N(I(X;Y )"3!). (10.4) Proof. The proof of parts 1 and 2 by the law of large numbers follows that of the source coding theorem in Chapter 4. For part 2, let the pair x, y play the role of x in the source coding theorem, replacing P (x) there by the probability distribution P (x, y). For the third part, P ((x!,y!) % JN!) = " (x,y)#JN! P (x)P (y) (10.5) # |JN! | 2"N(H(X)"!) 2"N(H(Y )"!) (10.6) # 2N(H(X,Y )+!)"N(H(X)+H(Y )"2!) (10.7) = 2"N(I(X;Y )"3!). ! (10.8) A cartoon of the jointly-typical set is shown in figure 10.2. Two independent typical vectors are jointly typical with probability P ((x!,y!) % JN!) & 2"N(I(X;Y )) (10.9) because the total number of independent typical pairs is the area of the dashed rectangle, 2NH(X)2NH(Y ), and the number of jointly-typical pairs is roughly 2NH(X,Y ), so the probability of hitting a jointly-typical pair is roughly 2NH(X,Y )/2NH(X)+NH(Y ) = 2"NI(X;Y ). (10.10) Here: x has 10 1’s (c.f. p(X = 1) = 0.1) y has 26 1’s (c.f. p(Y = 1) = (0.8)(0.1) + (0.2)(0.9) = 0.26) x, y differ in 20 bits (c.f. p(X 6= Y ) = 0.2) I This is essential in addition to the above two facts 11 / 34 Joint Typicality CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 164 10 — The Noisy-Channel Coding Theorem ! " # $ "! 2NH(X)$ # 2NH(Y ) "! "! 2NH(X|Y ) $ # $ # 2NH(Y |X) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2NH(X,Y ) dots ANX ANY Figure 10.2. The jointly-typical set. The horizontal direction represents ANX , the set of all input strings of length N . The vertical direction represents ANY , the set of all output strings of length N . The outer box contains all conceivable input–output pairs. Each dot represents a jointly-typical pair of sequences (x,y). The total number of jointly-typical sequences is about 2NH(X,Y ). 10.3 Proof of the noisy-channel coding theorem Analogy Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are di!cult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies Figure 10.3. Shannon’s method for proving one baby weighs less than 10 kg. and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved. From skinny children to fantastic codes We wish to show that there exists a code and a decoder having small prob- ability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error prob- ability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. Random coding and typical-set decoding Consider the following encoding–decoding system, whose rate is R !. 1. We fix P (x) and generate the S = 2NR ! codewords of a (N,NR!) = There are approximately: 2NH(X) typical x ∈ ANX 2NH(Y ) typical y ∈ ANY 2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY 2NH(Y |X) typical y given x Thus, by selecting independent typical vectors, we arrive at a jointly typical vector with probability approximately 2NH(X ,Y ) 2NH(X) · 2NH(Y ) = 2−NI(X ;Y ) Here we used I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) 12 / 34 Joint Typicality CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 164 10 — The Noisy-Channel Coding Theorem ! " # $ "! 2NH(X)$ # 2NH(Y ) "! "! 2NH(X|Y ) $ # $ # 2NH(Y |X) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2NH(X,Y ) dots ANX ANY Figure 10.2. The jointly-typical set. The horizontal direction represents ANX , the set of all input strings of length N . The vertical direction represents ANY , the set of all output strings of length N . The outer box contains all conceivable input–output pairs. Each dot represents a jointly-typical pair of sequences (x,y). The total number of jointly-typical sequences is about 2NH(X,Y ). 10.3 Proof of the noisy-channel coding theorem Analogy Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are di!cult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies Figure 10.3. Shannon’s method for proving one baby weighs less than 10 kg. and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved. From skinny children to fantastic codes We wish to show that there exists a code and a decoder having small prob- ability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error prob- ability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. Random coding and typical-set decoding Consider the following encoding–decoding system, whose rate is R !. 1. We fix P (x) and generate the S = 2NR ! codewords of a (N,NR!) = There are approximately: 2NH(X) typical x ∈ ANX 2NH(Y ) typical y ∈ ANY 2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY 2NH(Y |X) typical y given x Thus, by selecting independent typical vectors, we arrive at a jointly typical vector with probability approximately 2NH(X ,Y ) 2NH(X) · 2NH(Y ) = 2−NI(X ;Y ) Here we used I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) 12 / 34 Joint Typicality CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 164 10 — The Noisy-Channel Coding Theorem ! " # $ "! 2NH(X)$ # 2NH(Y ) "! "! 2NH(X|Y ) $ # $ # 2NH(Y |X) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2NH(X,Y ) dots ANX ANY Figure 10.2. The jointly-typical set. The horizontal direction represents ANX , the set of all input strings of length N . The vertical direction represents ANY , the set of all output strings of length N . The outer box contains all conceivable input–output pairs. Each dot represents a jointly-typical pair of sequences (x,y). The total number of jointly-typical sequences is about 2NH(X,Y ). 10.3 Proof of the noisy-channel coding theorem Analogy Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are di!cult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies Figure 10.3. Shannon’s method for proving one baby weighs less than 10 kg. and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved. From skinny children to fantastic codes We wish to show that there exists a code and a decoder having small prob- ability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error prob- ability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. Random coding and typical-set decoding Consider the following encoding–decoding system, whose rate is R !. 1. We fix P (x) and generate the S = 2NR ! codewords of a (N,NR!) = There are approximately: 2NH(X) typical x ∈ ANX 2NH(Y ) typical y ∈ ANY 2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY 2NH(Y |X) typical y given x Thus, by selecting independent typical vectors, we arrive at a jointly typical vector with probability approximately 2NH(X ,Y ) 2NH(X) · 2NH(Y ) = 2−NI(X ;Y ) Here we used I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) 12 / 34 Joint Typicality CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 164 10 — The Noisy-Channel Coding Theorem ! " # $ "! 2NH(X)$ # 2NH(Y ) "! "! 2NH(X|Y ) $ # $ # 2NH(Y |X) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2NH(X,Y ) dots ANX ANY Figure 10.2. The jointly-typical set. The horizontal direction represents ANX , the set of all input strings of length N . The vertical direction represents ANY , the set of all output strings of length N . The outer box contains all conceivable input–output pairs. Each dot represents a jointly-typical pair of sequences (x,y). The total number of jointly-typical sequences is about 2NH(X,Y ). 10.3 Proof of the noisy-channel coding theorem Analogy Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are di!cult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies Figure 10.3. Shannon’s method for proving one baby weighs less than 10 kg. and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved. From skinny children to fantastic codes We wish to show that there exists a code and a decoder having small prob- ability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error prob- ability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. Random coding and typical-set decoding Consider the following encoding–decoding system, whose rate is R !. 1. We fix P (x) and generate the S = 2NR ! codewords of a (N,NR!) = There are approximately: 2NH(X) typical x ∈ ANX 2NH(Y ) typical y ∈ ANY 2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY 2NH(Y |X) typical y given x Thus, by selecting independent typical vectors, we arrive at a jointly typical vector with probability approximately 2NH(X ,Y ) 2NH(X) · 2NH(Y ) = 2−NI(X ;Y ) Here we used I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) 12 / 34 Joint Typicality CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 164 10 — The Noisy-Channel Coding Theorem ! " # $ "! 2NH(X)$ # 2NH(Y ) "! "! 2NH(X|Y ) $ # $ # 2NH(Y |X) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2NH(X,Y ) dots ANX ANY Figure 10.2. The jointly-typical set. The horizontal direction represents ANX , the set of all input strings of length N . The vertical direction represents ANY , the set of all output strings of length N . The outer box contains all conceivable input–output pairs. Each dot represents a jointly-typical pair of sequences (x,y). The total number of jointly-typical sequences is about 2NH(X,Y ). 10.3 Proof of the noisy-channel coding theorem Analogy Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are di!cult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies Figure 10.3. Shannon’s method for proving one baby weighs less than 10 kg. and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved. From skinny children to fantastic codes We wish to show that there exists a code and a decoder having small prob- ability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error prob- ability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. Random coding and typical-set decoding Consider the following encoding–decoding system, whose rate is R !. 1. We fix P (x) and generate the S = 2NR ! codewords of a (N,NR!) = There are approximately: 2NH(X) typical x ∈ ANX 2NH(Y ) typical y ∈ ANY 2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY 2NH(Y |X) typical y given x Thus, by selecting independent typical vectors, we arrive at a jointly typical vector with probability approximately 2NH(X ,Y ) 2NH(X) · 2NH(Y ) = 2−NI(X ;Y ) Here we used I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) 12 / 34 Joint Typicality CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 164 10 — The Noisy-Channel Coding Theorem ! " # $ "! 2NH(X)$ # 2NH(Y ) "! "! 2NH(X|Y ) $ # $ # 2NH(Y |X) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 2NH(X,Y ) dots ANX ANY Figure 10.2. The jointly-typical set. The horizontal direction represents ANX , the set of all input strings of length N . The vertical direction represents ANY , the set of all output strings of length N . The outer box contains all conceivable input–output pairs. Each dot represents a jointly-typical pair of sequences (x,y). The total number of jointly-typical sequences is about 2NH(X,Y ). 10.3 Proof of the noisy-channel coding theorem Analogy Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are di!cult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies Figure 10.3. Shannon’s method for proving one baby weighs less than 10 kg. and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved. From skinny children to fantastic codes We wish to show that there exists a code and a decoder having small prob- ability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error prob- ability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. Random coding and typical-set decoding Consider the following encoding–decoding system, whose rate is R !. 1. We fix P (x) and generate the S = 2NR ! codewords of a (N,NR!) = There are approximately: 2NH(X) typical x ∈ ANX 2NH(Y ) typical y ∈ ANY 2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY 2NH(Y |X) typical y given x Thus, by selecting independent typical vectors, we arrive at a jointly typical vector with probability approximately 2NH(X ,Y ) 2NH(X) · 2NH(Y ) = 2−NI(X ;Y ) Here we used I(X ;Y ) = H(X ) + H(Y )− H(X ,Y ) 12 / 34 Joint Typicality Theorem Let x, y be drawn from (XY )N with P(x, y) = ∏ n P(xn, yn). Joint Typicality Theorem For all tolerances β > 0

1 Almost every pair is eventually jointly typical
P((x, y) ∈ JNβ)→ 1 as N →∞

2 The number of jointly typical sequences is
roughly 2NH(X ,Y ):

|JNβ| ≤ 2N(H(X ,Y )+β)

3 For x′ and y′ drawn independently from the
marginals of P(x, y),

P((x′, y′) ∈ JNβ) ≤ 2−N(I(X ;Y )−3β)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

164 10 — The Noisy-Channel Coding Theorem

! ”

$ “!
2NH(X)$

2NH(Y )
“!

“!
2NH(X|Y )

#
2NH(Y |X)

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

2NH(X,Y ) dots

ANX

ANY

Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).

10.3 Proof of the noisy-channel coding theorem

Analogy

Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies

Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.

and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.

From skinny children to fantastic codes

We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.

Random coding and typical-set decoding

Consider the following encoding–decoding system, whose rate is R !.

1. We fix P (x) and generate the S = 2NR
!

codewords of a (N,NR!) =

13 / 34

Joint Typicality Theorem

Let x, y be drawn from (XY )N with P(x, y) =
∏

n P(xn, yn).

Joint Typicality Theorem
For all tolerances β > 0

1 Almost every pair is eventually jointly typical
P((x, y) ∈ JNβ)→ 1 as N →∞

2 The number of jointly typical sequences is
roughly 2NH(X ,Y ):

|JNβ| ≤ 2N(H(X ,Y )+β)

3 For x′ and y′ drawn independently from the
marginals of P(x, y),

P((x′, y′) ∈ JNβ) ≤ 2−N(I(X ;Y )−3β)

164 10 — The Noisy-Channel Coding Theorem

! ”

$ “!
2NH(X)$

2NH(Y )
“!

“!
2NH(X|Y )

#
2NH(Y |X)

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

2NH(X,Y ) dots

ANX

ANY

10.3 Proof of the noisy-channel coding theorem

Analogy

Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.

From skinny children to fantastic codes

Random coding and typical-set decoding

Consider the following encoding–decoding system, whose rate is R !.

1. We fix P (x) and generate the S = 2NR
!

codewords of a (N,NR!) =

13 / 34

Joint Typicality Theorem

Let x, y be drawn from (XY )N with P(x, y) =
∏

n P(xn, yn).

Joint Typicality Theorem
For all tolerances β > 0

1 Almost every pair is eventually jointly typical
P((x, y) ∈ JNβ)→ 1 as N →∞

2 The number of jointly typical sequences is
roughly 2NH(X ,Y ):

|JNβ| ≤ 2N(H(X ,Y )+β)

3 For x′ and y′ drawn independently from the
marginals of P(x, y),

P((x′, y′) ∈ JNβ) ≤ 2−N(I(X ;Y )−3β)

164 10 — The Noisy-Channel Coding Theorem

! ”

$ “!
2NH(X)$

2NH(Y )
“!

“!
2NH(X|Y )

#
2NH(Y |X)

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

2NH(X,Y ) dots

ANX

ANY

10.3 Proof of the noisy-channel coding theorem

Analogy

Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.

From skinny children to fantastic codes

Random coding and typical-set decoding

Consider the following encoding–decoding system, whose rate is R !.

1. We fix P (x) and generate the S = 2NR
!

codewords of a (N,NR!) =

13 / 34

1 Noisy-Channel Coding Theorem

2 Joint Typicality

3 Proof Sketch of the NCCT

4 Good Codes vs. Practical Codes

5 Linear Codes

14 / 34

The Noisy-Channel Coding Theorem

Let Q be a channel with inputs AX and outputs AY .
Let C = maxpX I(X ;Y ) be the capacity of Q and
H2(p) = −p log2 p − (1− p) log2(1− p).

The Noisy-Channel Coding Theorem

10.1 The theorem

The theorem has three parts, two positive and one negative. The main positive
result is the first.

”

#
#
#
#

C R

R(pb)

1 2
3

Figure 10.1. Portion of the R, pb
plane to be proved achievable
(1, 2) and not achievable (3).

1. For every discrete memoryless channel, the channel capacity

C = max
PX

I(X;Y ) (10.1)

has the following property. For any ! > 0 and R < C, for large enough N , there exists a code of length N and rate ! R and a decoding algorithm, such that the maximal probability of block error is < !. 2. If a probability of bit error pb is acceptable, rates up to R(pb) are achiev- able, where R(pb) = C 1 " H2(pb) . (10.2) 3. For any pb, rates greater than R(pb) are not achievable. 10.2 Jointly-typical sequences We formalize the intuitive preview of the last chapter. We will define codewords x(s) as coming from an ensemble XN , and con- sider the random selection of one codeword and a corresponding channel out- put y, thus defining a joint ensemble (XY )N . We will use a typical-set decoder, which decodes a received signal y as s if x(s) and y are jointly typical, a term to be defined shortly. The proof will then centre on determining the probabilities (a) that the true input codeword is not jointly typical with the output sequence; and (b) that a false input codeword is jointly typical with the output. We will show that, for large N , both probabilities go to zero as long as there are fewer than 2NC codewords, and the ensemble X is the optimal input distribution. Joint typicality. A pair of sequences x,y of length N are defined to be jointly typical (to tolerance ") with respect to the distribution P (x, y) if x is typical of P (x), i.e., !!!! 1 N log 1 P (x) " H(X) !!!! < ", y is typical of P (y), i.e., !!!! 1 N log 1 P (y) " H(Y ) !!!! < ", and x,y is typical of P (x,y), i.e., !!!! 1 N log 1 P (x,y) " H(X,Y ) !!!! < ". 162 The Noisy-Channel Coding Theorem 1 Any rate R < C is achievable for Q (i.e., for any tolerance � > 0, an

(N,K ) code with rate K/N ≥ R exists with max. block error pBM < �) 2 If probability of bit error pb := pB/K is acceptable, there exist (N,K ) codes with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any pb, rates greater than R(pb) are not achievable. 15 / 34 The Noisy-Channel Coding Theorem Let Q be a channel with inputs AX and outputs AY . Let C = maxpX I(X ;Y ) be the capacity of Q and H2(p) = −p log2 p − (1− p) log2(1− p). Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 The Noisy-Channel Coding Theorem 10.1 The theorem The theorem has three parts, two positive and one negative. The main positive result is the first. ! " # # # # C R R(pb) pb 1 2 3 Figure 10.1. Portion of the R, pb plane to be proved achievable (1, 2) and not achievable (3). 1. For every discrete memoryless channel, the channel capacity C = max PX I(X;Y ) (10.1) has the following property. For any ! > 0 and R < C, for large enough N , there exists a code of length N and rate ! R and a decoding algorithm, such that the maximal probability of block error is < !. 2. If a probability of bit error pb is acceptable, rates up to R(pb) are achiev- able, where R(pb) = C 1 " H2(pb) . (10.2) 3. For any pb, rates greater than R(pb) are not achievable. 10.2 Jointly-typical sequences We formalize the intuitive preview of the last chapter. We will define codewords x(s) as coming from an ensemble XN , and con- sider the random selection of one codeword and a corresponding channel out- put y, thus defining a joint ensemble (XY )N . We will use a typical-set decoder, which decodes a received signal y as s if x(s) and y are jointly typical, a term to be defined shortly. The proof will then centre on determining the probabilities (a) that the true input codeword is not jointly typical with the output sequence; and (b) that a false input codeword is jointly typical with the output. We will show that, for large N , both probabilities go to zero as long as there are fewer than 2NC codewords, and the ensemble X is the optimal input distribution. Joint typicality. A pair of sequences x,y of length N are defined to be jointly typical (to tolerance ") with respect to the distribution P (x, y) if x is typical of P (x), i.e., !!!! 1 N log 1 P (x) " H(X) !!!! < ", y is typical of P (y), i.e., !!!! 1 N log 1 P (y) " H(Y ) !!!! < ", and x,y is typical of P (x,y), i.e., !!!! 1 N log 1 P (x,y) " H(X,Y ) !!!! < ". 162 The Noisy-Channel Coding Theorem 1 Any rate R < C is achievable for Q (i.e., for any tolerance � > 0, an

(N,K ) code with rate K/N ≥ R exists with max. block error pBM < �) 2 If probability of bit error pb := pB/K is acceptable, there exist (N,K ) codes with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any pb, rates greater than R(pb) are not achievable. 15 / 34 Some Intuition for the NCCT The proof of the NCCT is based on the following observations: Each choice of input distribution pX induces an output distibution pY There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y )) For each x there are 2NH(Y |X) typical y for x At most there are 2 NH(Y ) 2NH(Y |X) = 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint typical y. Coding with these x minimises error Best rate K/N achieved when number of such x (i.e., 2K ) is maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC 1111 0111 1011 0011 1101 0101 1001 0001 1110 0110 1010 0010 1100 0100 1000 0000 00 00 10 00 01 00 11 00 00 10 10 10 01 10 11 10 00 01 10 01 01 01 11 01 00 11 10 11 01 11 11 11 = 4 16 / 34 Some Intuition for the NCCT The proof of the NCCT is based on the following observations: Each choice of input distribution pX induces an output distibution pY There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y )) For each x there are 2NH(Y |X) typical y for x At most there are 2 NH(Y ) 2NH(Y |X) = 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint typical y. Coding with these x minimises error Best rate K/N achieved when number of such x (i.e., 2K ) is maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC 1111 0111 1011 0011 1101 0101 1001 0001 1110 0110 1010 0010 1100 0100 1000 0000 00 00 10 00 01 00 11 00 00 10 10 10 01 10 11 10 00 01 10 01 01 01 11 01 00 11 10 11 01 11 11 11 = 4 16 / 34 Some Intuition for the NCCT The proof of the NCCT is based on the following observations: Each choice of input distribution pX induces an output distibution pY There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y )) For each x there are 2NH(Y |X) typical y for x At most there are 2 NH(Y ) 2NH(Y |X) = 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint typical y. Coding with these x minimises error Best rate K/N achieved when number of such x (i.e., 2K ) is maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC 1111 0111 1011 0011 1101 0101 1001 0001 1110 0110 1010 0010 1100 0100 1000 0000 00 00 10 00 01 00 11 00 00 10 10 10 01 10 11 10 00 01 10 01 01 01 11 01 00 11 10 11 01 11 11 11 = 4 16 / 34 Some Intuition for the NCCT The proof of the NCCT is based on the following observations: Each choice of input distribution pX induces an output distibution pY There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y )) For each x there are 2NH(Y |X) typical y for x At most there are 2 NH(Y ) 2NH(Y |X) = 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint typical y. Coding with these x minimises error Best rate K/N achieved when number of such x (i.e., 2K ) is maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 154 9 — Communication over a Noisy Channel 1 0 0 1 11 01 10 00 00 10 01 11 1111 0111 1011 0011 1101 0101 1001 0001 1110 0110 1010 0010 1100 0100 1000 0000 00 00 10 00 01 00 11 00 00 10 10 10 01 10 11 10 00 01 10 01 01 01 11 01 00 11 10 11 01 11 11 11 N = 1 N = 2 N = 4 Figure 9.7. Extended channels obtained from a Z channel with transition probability 0.15. Each column corresponds to an input, and each row is a di!erent output. ANY ! " # $ Typical y !" #$ !" #$!" #$ !" #$!" #$!" #$ !" #$!" #$ !" #$ !" #$ !" #$ !" #$!" #$!" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$ !" #$!" #$ !" #$!" #$!" #$ !" #$!" #$ !" #$ !" #$ !" #$ !" #$ !" #$!" #$ !" #$ !" #$ !" #$ !" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$ %& '( ! Typical y for a given typical x ANY ! " # $ Typical y !" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$ (a) (b) Figure 9.8. (a) Some typical outputs in ANY corresponding to typical inputs x. (b) A subset of the typical sets shown in (a) that do not overlap each other. This picture can be compared with the solution to the noisy typewriter in figure 9.5. Exercise 9.14.[2, p.159] Find the transition probability matrices Q for the ex- tended channel, with N = 2, derived from the binary erasure channel having erasure probability 0.15. By selecting two columns of this transition probability matrix, we can define a rate-1/2 code for this channel with blocklength N = 2. What is the best choice of two columns? What is the decoding algorithm? To prove the noisy-channel coding theorem, we make use of large block- lengths N . The intuitive idea is that, if N is large, an extended channel looks a lot like the noisy typewriter. Any particular input x is very likely to produce an output in a small subspace of the output alphabet – the typical output set, given that input. So we can find a non-confusable subset of the inputs that produce essentially disjoint output sequences. For a given N , let us consider a way of generating such a non-confusable subset of the inputs, and count up how many distinct inputs it contains. Imagine making an input sequence x for the extended channel by drawing it from an ensemble XN , where X is an arbitrary ensemble over the input alphabet. Recall the source coding theorem of Chapter 4, and consider the number of probable output sequences y. The total number of typical output sequences y is 2NH(Y ), all having similar probability. For any particular typical input sequence x, there are about 2NH(Y |X) probable sequences. Some of these subsets of ANY are depicted by circles in figure 9.8a. We now imagine restricting ourselves to a subset of the typical inputs x such that the corresponding typical output sets do not overlap, as shown in figure 9.8b. We can then bound the number of non-confusable inputs by dividing the size of the typical y set, 2NH(Y ), by the size of each typical-y- 16 / 34 Some Intuition for the NCCT The proof of the NCCT is based on the following observations: Each choice of input distribution pX induces an output distibution pY There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y )) For each x there are 2NH(Y |X) typical y for x At most there are 2 NH(Y ) 2NH(Y |X) = 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint typical y. Coding with these x minimises error Best rate K/N achieved when number of such x (i.e., 2K ) is maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 154 9 — Communication over a Noisy Channel 1 0 0 1 11 01 10 00 00 10 01 11 1111 0111 1011 0011 1101 0101 1001 0001 1110 0110 1010 0010 1100 0100 1000 0000 00 00 10 00 01 00 11 00 00 10 10 10 01 10 11 10 00 01 10 01 01 01 11 01 00 11 10 11 01 11 11 11 N = 1 N = 2 N = 4 Figure 9.7. Extended channels obtained from a Z channel with transition probability 0.15. Each column corresponds to an input, and each row is a di!erent output. ANY ! " # $ Typical y !" #$ !" #$!" #$ !" #$!" #$!" #$ !" #$!" #$ !" #$ !" #$ !" #$ !" #$!" #$!" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$ !" #$!" #$ !" #$!" #$!" #$ !" #$!" #$ !" #$ !" #$ !" #$ !" #$ !" #$!" #$ !" #$ !" #$ !" #$ !" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$ %& '( ! Typical y for a given typical x ANY ! " # $ Typical y !" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$!" #$ !" #$ !" #$ (a) (b) Figure 9.8. (a) Some typical outputs in ANY corresponding to typical inputs x. (b) A subset of the typical sets shown in (a) that do not overlap each other. This picture can be compared with the solution to the noisy typewriter in figure 9.5. Exercise 9.14.[2, p.159] Find the transition probability matrices Q for the ex- tended channel, with N = 2, derived from the binary erasure channel having erasure probability 0.15. By selecting two columns of this transition probability matrix, we can define a rate-1/2 code for this channel with blocklength N = 2. What is the best choice of two columns? What is the decoding algorithm? To prove the noisy-channel coding theorem, we make use of large block- lengths N . The intuitive idea is that, if N is large, an extended channel looks a lot like the noisy typewriter. Any particular input x is very likely to produce an output in a small subspace of the output alphabet – the typical output set, given that input. So we can find a non-confusable subset of the inputs that produce essentially disjoint output sequences. For a given N , let us consider a way of generating such a non-confusable subset of the inputs, and count up how many distinct inputs it contains. Imagine making an input sequence x for the extended channel by drawing it from an ensemble XN , where X is an arbitrary ensemble over the input alphabet. Recall the source coding theorem of Chapter 4, and consider the number of probable output sequences y. The total number of typical output sequences y is 2NH(Y ), all having similar probability. For any particular typical input sequence x, there are about 2NH(Y |X) probable sequences. Some of these subsets of ANY are depicted by circles in figure 9.8a. We now imagine restricting ourselves to a subset of the typical inputs x such that the corresponding typical output sets do not overlap, as shown in figure 9.8b. We can then bound the number of non-confusable inputs by dividing the size of the typical y set, 2NH(Y ), by the size of each typical-y- 16 / 34 Proof Sketch of NCCT Part 1 We can: define a family of random codes, which rely on joint typicality, and which achieve the given rate show that on average, such a code has a low probability of block error deduce that at least one such code must have a low probability of block error “expurgate” the above code so that it has low maximal probability of error This will establish that the final code achieves low maximal probability of error, while achieving the given rate! 17 / 34 Proof Sketch of NCCT Part 1 We can: define a family of random codes, which rely on joint typicality, and which achieve the given rate show that on average, such a code has a low probability of block error deduce that at least one such code must have a low probability of block error “expurgate” the above code so that it has low maximal probability of error This will establish that the final code achieves low maximal probability of error, while achieving the given rate! 17 / 34 Proof Sketch of NCCT Part 1 We can: define a family of random codes, which rely on joint typicality, and which achieve the given rate show that on average, such a code has a low probability of block error deduce that at least one such code must have a low probability of block error “expurgate” the above code so that it has low maximal probability of error This will establish that the final code achieves low maximal probability of error, while achieving the given rate! 17 / 34 Proof Sketch of NCCT Part 1 We can: define a family of random codes, which rely on joint typicality, and which achieve the given rate show that on average, such a code has a low probability of block error deduce that at least one such code must have a low probability of block error “expurgate” the above code so that it has low maximal probability of error This will establish that the final code achieves low maximal probability of error, while achieving the given rate! 17 / 34 Random Coding and Typical Set Decoding Make random code C with rate R′: Fix pX and choose S = 2NR ′ codewords, x(1), . . . , x(S), each with P(x) = ∏ n P(xn) Decode y via typical sets: If there is exactly one ŝ so that (xŝ, y) are jointly typical then decode y as ŝ Otherwise, fail (ŝ = 0) Errors: pB(C) = P(ŝ 6= s|C) 〈pB〉 = ∑ C P(ŝ 6= s|C)P(C) pBM(C) = maxs P(ŝ 6= s|s, C) (Aim: ∃C s.t. pBM(C) small) Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.3: Proof of the noisy-channel coding theorem 165 x(3) x(1) x(2) x(4)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! x(3) x(1) x(2) x(4) yc yd yb ya ! ! ! ! ŝ(yc)=4 ŝ(yd)=0 ŝ(yb)= 3 ŝ(ya)=0 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! (a) (b) Figure 10.4. (a) A random code. (b) Example decodings by the typical set decoder. A sequence that is not jointly typical with any of the codewords, such as ya, is decoded as ŝ = 0. A sequence that is jointly typical with codeword x(3) alone, yb, is decoded as ŝ = 3. Similarly, yc is decoded as ŝ = 4. A sequence that is jointly typical with more than one codeword, such as yd, is decoded as ŝ = 0. (N,K) code C at random according to P (x) = N! n=1 P (xn). (10.11) A random code is shown schematically in figure 10.4a. 2. The code is known to both sender and receiver. 3. A message s is chosen from {1, 2, . . . , 2NR!}, and x(s) is transmitted. The received signal is y, with P (y |x(s)) = N! n=1 P (yn |x(s)n ). (10.12) 4. The signal is decoded by typical-set decoding. Typical-set decoding. Decode y as ŝ if (x(ŝ),y) are jointly typical and there is no other s! such that (x(s !),y) are jointly typical; otherwise declare a failure (ŝ=0). This is not the optimal decoding algorithm, but it will be good enough, and easier to analyze. The typical-set decoder is illustrated in fig- ure 10.4b. 5. A decoding error occurs if ŝ != s. There are three probabilities of error that we can distinguish. First, there is the probability of block error for a particular code C, that is, pB(C) " P (ŝ != s | C). (10.13) This is a di!cult quantity to evaluate for any given code. Second, there is the average over all codes of this block error probability, #pB$ " " C P (ŝ != s | C)P (C). (10.14) 18 / 34 Random Coding and Typical Set Decoding Make random code C with rate R′: Fix pX and choose S = 2NR ′ codewords, x(1), . . . , x(S), each with P(x) = ∏ n P(xn) Decode y via typical sets: If there is exactly one ŝ so that (xŝ, y) are jointly typical then decode y as ŝ Otherwise, fail (ŝ = 0) Errors: pB(C) = P(ŝ 6= s|C) 〈pB〉 = ∑ C P(ŝ 6= s|C)P(C) pBM(C) = maxs P(ŝ 6= s|s, C) (Aim: ∃C s.t. pBM(C) small) Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.3: Proof of the noisy-channel coding theorem 165 x(3) x(1) x(2) x(4)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! x(3) x(1) x(2) x(4) yc yd yb ya ! ! ! ! ŝ(yc)=4 ŝ(yd)=0 ŝ(yb)= 3 ŝ(ya)=0 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! (a) (b) Figure 10.4. (a) A random code. (b) Example decodings by the typical set decoder. A sequence that is not jointly typical with any of the codewords, such as ya, is decoded as ŝ = 0. A sequence that is jointly typical with codeword x(3) alone, yb, is decoded as ŝ = 3. Similarly, yc is decoded as ŝ = 4. A sequence that is jointly typical with more than one codeword, such as yd, is decoded as ŝ = 0. (N,K) code C at random according to P (x) = N! n=1 P (xn). (10.11) A random code is shown schematically in figure 10.4a. 2. The code is known to both sender and receiver. 3. A message s is chosen from {1, 2, . . . , 2NR!}, and x(s) is transmitted. The received signal is y, with P (y |x(s)) = N! n=1 P (yn |x(s)n ). (10.12) 4. The signal is decoded by typical-set decoding. Typical-set decoding. Decode y as ŝ if (x(ŝ),y) are jointly typical and there is no other s! such that (x(s !),y) are jointly typical; otherwise declare a failure (ŝ=0). This is not the optimal decoding algorithm, but it will be good enough, and easier to analyze. The typical-set decoder is illustrated in fig- ure 10.4b. 5. A decoding error occurs if ŝ != s. There are three probabilities of error that we can distinguish. First, there is the probability of block error for a particular code C, that is, pB(C) " P (ŝ != s | C). (10.13) This is a di!cult quantity to evaluate for any given code. Second, there is the average over all codes of this block error probability, #pB$ " " C P (ŝ != s | C)P (C). (10.14) 18 / 34 Random Coding and Typical Set Decoding Make random code C with rate R′: Fix pX and choose S = 2NR ′ codewords, x(1), . . . , x(S), each with P(x) = ∏ n P(xn) Decode y via typical sets: If there is exactly one ŝ so that (xŝ, y) are jointly typical then decode y as ŝ Otherwise, fail (ŝ = 0) Errors: pB(C) = P(ŝ 6= s|C) 〈pB〉 = ∑ C P(ŝ 6= s|C)P(C) pBM(C) = maxs P(ŝ 6= s|s, C) (Aim: ∃C s.t. pBM(C) small) Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.3: Proof of the noisy-channel coding theorem 165 x(3) x(1) x(2) x(4)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! x(3) x(1) x(2) x(4) yc yd yb ya ! ! ! ! ŝ(yc)=4 ŝ(yd)=0 ŝ(yb)= 3 ŝ(ya)=0 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! (a) (b) Figure 10.4. (a) A random code. (b) Example decodings by the typical set decoder. A sequence that is not jointly typical with any of the codewords, such as ya, is decoded as ŝ = 0. A sequence that is jointly typical with codeword x(3) alone, yb, is decoded as ŝ = 3. Similarly, yc is decoded as ŝ = 4. A sequence that is jointly typical with more than one codeword, such as yd, is decoded as ŝ = 0. (N,K) code C at random according to P (x) = N! n=1 P (xn). (10.11) A random code is shown schematically in figure 10.4a. 2. The code is known to both sender and receiver. 3. A message s is chosen from {1, 2, . . . , 2NR!}, and x(s) is transmitted. The received signal is y, with P (y |x(s)) = N! n=1 P (yn |x(s)n ). (10.12) 4. The signal is decoded by typical-set decoding. Typical-set decoding. Decode y as ŝ if (x(ŝ),y) are jointly typical and there is no other s! such that (x(s !),y) are jointly typical; otherwise declare a failure (ŝ=0). This is not the optimal decoding algorithm, but it will be good enough, and easier to analyze. The typical-set decoder is illustrated in fig- ure 10.4b. 5. A decoding error occurs if ŝ != s. There are three probabilities of error that we can distinguish. First, there is the probability of block error for a particular code C, that is, pB(C) " P (ŝ != s | C). (10.13) This is a di!cult quantity to evaluate for any given code. Second, there is the average over all codes of this block error probability, #pB$ " " C P (ŝ != s | C)P (C). (10.14) 18 / 34 Average Error Over All Codes Let’s consider the average error over random codes: 〈pB〉 = ∑ C P(ŝ 6= s|C)P(C) A bound on the average 〈f 〉 of some function f of random variables z ∈ Z with probabilities P(z) guarantees there is at least one z∗ ∈ Z such that f (z∗) is smaller than the bound.1 So 〈pB〉 < δ =⇒ pB(C∗) < δ for some C∗. Analogy: Suppose the average height of class is not more than 160 cm. Then one of you must be shorter than 160 cm. 1If 〈f 〉 < δ but f (z) ≥ δ for all z, 〈f 〉 = ∑ z f (z)P(z) ≥ ∑ z δP(z) = δ !! 19 / 34 Average Error Over All Codes Let’s consider the average error over random codes: 〈pB〉 = ∑ C P(ŝ 6= s|C)P(C) A bound on the average 〈f 〉 of some function f of random variables z ∈ Z with probabilities P(z) guarantees there is at least one z∗ ∈ Z such that f (z∗) is smaller than the bound.1 So 〈pB〉 < δ =⇒ pB(C∗) < δ for some C∗. Analogy: Suppose the average height of class is not more than 160 cm. Then one of you must be shorter than 160 cm. 1If 〈f 〉 < δ but f (z) ≥ δ for all z, 〈f 〉 = ∑ z f (z)P(z) ≥ ∑ z δP(z) = δ !! 19 / 34 Proof Sketch of NCCT Part 1 Want to prove Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate N/K ≥ R exists with max. block error pBM < � for any tolerance �) Let us thus bound 〈pB〉 for our random code Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)

such that the probability (x, y) are not jointly typical is less than δ.

2 Thus, the average probability of error satisfies (by Part 3 of JCT)

3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β 4 Choosing maximal P(x) makes required condition R′ < C − 3β 20 / 34 Proof Sketch of NCCT Part 1 Want to prove Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate N/K ≥ R exists with max. block error pBM < � for any tolerance �) Let us thus bound 〈pB〉 for our random code Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)

such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)

〈pB〉 =
∑

atypical (x,y)

P(ŝ 6= s|·) +
∑

typical (x,y)

P(ŝ 6= s|·)

such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)

〈pB〉 ≤ δ +
2NR

′
∑

s′=2

2−N(I(X ;Y )−3β)

such that the probability (x, y) are not jointly typical is less than δ.

2 Thus, the average probability of error satisfies (by Part 3 of JCT)

〈pB〉 ≤ δ + 2−N(I(X ;Y )−R
′−3β)

such that the probability (x, y) are not jointly typical is less than δ.

2 Thus, the average probability of error satisfies (by Part 3 of JCT)

〈pB〉 ≤ δ + 2−N(I(X ;Y )−R
′−3β)

such that the probability (x, y) are not jointly typical is less than δ.

2 Thus, the average probability of error satisfies (by Part 3 of JCT)

〈pB〉 ≤ δ + 2−N(I(X ;Y )−R
′−3β)

3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β 4 Choosing maximal P(x) makes required condition R′ < C − 3β 20 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Wrapping It All Up From the previous slide, 〈pB〉 < 2δ =⇒ some C′ such that pBM(C′) < 4δ with rate R′ − 1N Setting R′ = (R + C)/2, δ = �/4, β < (C − R′)/3 gives the result! 22 / 34 NCCT Part 1: Comments NCCT shows the existence of good codes; actually constructing practical codes is another matter In principle, one could try the coding scheme outlined in the proof However, it would require a lookup in an exponential sized table (for the typical set decoding)! Over the past few decades, some codes (e.g. Turbo codes) have been shown to achieve rate close to the Shannon capacity Beyond the scope of this course! 23 / 34 NCCT Converse: Comments One can in fact make a stronger statement about pB,avg = 1 2K ∑ sin P(sout 6= sin | sin), the probability of block error assuming a uniform distribution over inputs We have: pB,avg ≥ 1− O(e−N(R−C)) Thus, if R > C, the probability of block error shoots to 1 as N increases!

We have a “phase transition” around C between perfectly reliable and
perfectly unreliable communication!

24 / 34

1 Noisy-Channel Coding Theorem

2 Joint Typicality

3 Proof Sketch of the NCCT

4 Good Codes vs. Practical Codes

5 Linear Codes

25 / 34

Theory and Practice

The difference between theory and practice is that, in theory,
there is no difference between theory and practice but, in
practice, there is.

— Jan L. A. van de Snepscheut

Theory vs. Practice

The NCCT theorem tells us that good block codes exist for any noisy
channel (in fact, most random codes are good)

However, the theorem is non-constructive: it does not tell us how to
create practical codes for a given noisy channel

The construction of practical codes that achieve rates up to the
capacity for general channels is ongoing research

26 / 34

Theory and Practice

The difference between theory and practice is that, in theory,
there is no difference between theory and practice but, in
practice, there is.

— Jan L. A. van de Snepscheut

Theory vs. Practice

The NCCT theorem tells us that good block codes exist for any noisy
channel (in fact, most random codes are good)

However, the theorem is non-constructive: it does not tell us how to
create practical codes for a given noisy channel

The construction of practical codes that achieve rates up to the
capacity for general channels is ongoing research

26 / 34

Types of Codes

When we talk about types of codes we will be referring to schemes for
creating (N,K ) codes for any size N. MacKay makes the following
distinctions:

Bad: Cannot achieve arbitrarily small error, or only achieve it if the
rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)

Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �) Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �) Practical: Can be coded and decoded in time that is polynomial in the block length N. 27 / 34 Types of Codes When we talk about types of codes we will be referring to schemes for creating (N,K ) codes for any size N. MacKay makes the following distinctions: Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �) Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �) Practical: Can be coded and decoded in time that is polynomial in the block length N. 27 / 34 Types of Codes When we talk about types of codes we will be referring to schemes for creating (N,K ) codes for any size N. MacKay makes the following distinctions: Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �) Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �) Practical: Can be coded and decoded in time that is polynomial in the block length N. 27 / 34 Types of Codes When we talk about types of codes we will be referring to schemes for creating (N,K ) codes for any size N. MacKay makes the following distinctions: Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �) Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �) Practical: Can be coded and decoded in time that is polynomial in the block length N. 27 / 34 Random Codes During the discussion of the Noisy-Channel Coding Theorem we saw how to construct very good random codes via typical set decoding Properties: Very Good: Rates up to C are achievable with arbitrarily small error Construction is easy Not Practical: I The 2K codewords have no structure and must be “memorised” I Typical set decoding is expensive 28 / 34 1 Noisy-Channel Coding Theorem 2 Joint Typicality 3 Proof Sketch of the NCCT 4 Good Codes vs. Practical Codes 5 Linear Codes 29 / 34 Linear Codes (N,K ) Block Code An (N,K ) block code is a list of S = 2K codewords {x(1), . . . , x(S)}, each of length N. A message s ∈ {1, 2, . . . , 2K} is encoded as x(s). Linear (N,K ) Block Code A linear (N,K ) block code is an (N,K ) block code where s is first represented as a K -bit binary vector s ∈ {0, 1}K and then encoded via multiplication by an N × K binary matrix G> to form t = G>s modulo 2.

Here linear means all S = 2K messages can be obtained by “adding”
different combinations of the K codewords ti = G>ei where ei is K -bit
string with single 1 in position i .

Example: Suppose (N,K ) = (7, 4). To send s = 3, first create s = 0011
and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where
e0 = 0001 and e1 = 0010.

30 / 34

Linear Codes

(N,K ) Block Code

An (N,K ) block code is a list of S = 2K codewords {x(1), . . . , x(S)}, each
of length N. A message s ∈ {1, 2, . . . , 2K} is encoded as x(s).

Linear (N,K ) Block Code

A linear (N,K ) block code is an (N,K ) block code where s is first
represented as a K -bit binary vector s ∈ {0, 1}K and then encoded via
multiplication by an N × K binary matrix G> to form t = G>s modulo 2.

Here linear means all S = 2K messages can be obtained by “adding”
different combinations of the K codewords ti = G>ei where ei is K -bit
string with single 1 in position i .

Example: Suppose (N,K ) = (7, 4). To send s = 3, first create s = 0011
and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where
e0 = 0001 and e1 = 0010.

30 / 34

Linear Codes

(N,K ) Block Code

An (N,K ) block code is a list of S = 2K codewords {x(1), . . . , x(S)}, each
of length N. A message s ∈ {1, 2, . . . , 2K} is encoded as x(s).

Linear (N,K ) Block Code

Here linear means all S = 2K messages can be obtained by “adding”
different combinations of the K codewords ti = G>ei where ei is K -bit
string with single 1 in position i .

Example: Suppose (N,K ) = (7, 4). To send s = 3, first create s = 0011
and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where
e0 = 0001 and e1 = 0010.

30 / 34

Types of Linear Code

Many commonly used codes are linear:

Repetition Codes: e.g., 0→ 000 ; 1→ 111
Convolution Codes: Linear coding plus bit shifts
Concatenation Codes: Two or more levels of error correction
Hamming Codes: Parity checking
Low-Density Parity-Check Codes: Semi-random construction

An NCCT can be proved for linear codes (i.e., “there exists a linear code”
replacing “there exists a code”) but the proof is still non-constructive.

Practical linear codes:

Use very large block sizes N
Based on semi-random code constructions
Apply probabilistic decoding techniques
Used in wireless and satellite communication

31 / 34

Types of Linear Code

Many commonly used codes are linear:

An NCCT can be proved for linear codes (i.e., “there exists a linear code”
replacing “there exists a code”) but the proof is still non-constructive.

Practical linear codes:

Use very large block sizes N
Based on semi-random code constructions
Apply probabilistic decoding techniques
Used in wireless and satellite communication

31 / 34

Types of Linear Code

Many commonly used codes are linear:

An NCCT can be proved for linear codes (i.e., “there exists a linear code”
replacing “there exists a code”) but the proof is still non-constructive.

Practical linear codes:

Use very large block sizes N
Based on semi-random code constructions
Apply probabilistic decoding techniques
Used in wireless and satellite communication

31 / 34

Linear Codes: Examples

(7,4) Hamming Code

G> =




1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 1 1 0
0 1 1 1
1 0 1 1




For s = 0011,

G>s( mod 2) = [0 0 1 1 1 0 0]>

(6,3) Repetition Code

G> =




1 0 0
0 1 0
0 0 1
1 0 0
0 1 0
0 0 1




For s = 010,

G>s( mod 2) = [0 1 0 0 1 0]>

32 / 34

Decoding

We can construct codes with a relatively simple encoding but how do we
decode them? That is, given the input distribution and channel model Q
how do we find the posterior distribution over x given we received y?

Simple? Just compute

P(x|y) = P(y|x)∑
x′∈C P(y|x)P(x)

But:

the number of codes x ∈ C is 2K so, naively, the sum is expensive

linear codes provide structure that the above method doesn’t exploit

33 / 34

Decoding

Simple? Just compute

P(x|y) = P(y|x)∑
x′∈C P(y|x)P(x)

But:

the number of codes x ∈ C is 2K so, naively, the sum is expensive

linear codes provide structure that the above method doesn’t exploit

33 / 34

Summary and Reading

Main Points:

Joint Typicality and the Joint Typicality Theorem

The (Longer) Noisy Channel Coding Theorem

Proof Ideas
I Random Coding & Typical Set Decoding

I Average Error Over Random Codes

I Code Expurgation

Reading:

MacKay §9.7, §10.1-§10.5

34 / 34

Noisy-Channel Coding Theorem
Joint Typicality
Proof Sketch of the NCCT
Good Codes vs. Practical Codes
Linear Codes

Related Posts