COMP2610/6261 – Information Theory – Lecture 20: Joint-Typicality and the Noisy-Channel Coding Theorem
COMP2610/6261 – Information Theory
Lecture 20: Joint-Typicality and the Noisy-Channel Coding Theorem
Robert C. Williamson
Research School of Computer Science
1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y
ANU Logo Use Guidelines
Deep Gold
C30 M50 Y70 K40
PMS Metallic 8620
PMS 463
Black
C0 M0 Y0 K100
PMS Process Black
Preferred logo Black version
Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact
brand@anu.edu.au
The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:
First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.
Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.
Black
Where colour printing is not available, the black logo can
be used on a white background.
Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.
October 24th, 2018
1 / 34
Channel Capacity: Recap
The largest possible reduction in uncertainty achievable across a channel
is its capacity
Channel Capacity
The capacity C of a channel Q is the largest mutual information between
its input and output for any choice of input ensemble. That is,
C = max
pX
I(X ;Y )
2 / 34
Block Codes: Recap
(N,K ) Block Code
Given a channel Q with inputs X and outputs Y , an integer N > 0, and
K > 0, an (N,K ) Block Code for Q is a list of S = 2K codewords
C = {x(1), x(2), . . . , x(2K )}
where each x(s) ∈ XN consists of N symbols from X .
Rate of a block code is KN =
log2 S
N
3 / 34
Reliability: Recap
QEncoder
x ysin sout
Decoder
Probability of (Block) Error
Given a channel Q the probability of (block) error for a code is
pB = P(sout 6= sin) =
∑
sin
P(sout 6= sin|sin)P(sin)
and its maximum probability of (block) error is
pBM = max
sin
P(sout 6= sin|sin)
4 / 34
The Noisy-Channel Coding Theorem: Recap
Informal Statement
Recall that a rate R is achievable if there is a block code with this rate and
arbitrarily small error probability
We highlighted the following remarkable result:
Noisy-Channel Coding Theorem (Informal)
If Q is a channel with capacity C then the rate R is achievable if and only
if R ≤ C, that is, the rate is no greater than the channel capacity.
Ideally, we would like to know:
Can we go above C if we allow some fixed probability of error?
Is there a maximal rate for a fixed probability of error?
5 / 34
The Noisy-Channel Coding Theorem: Recap
Informal Statement
Recall that a rate R is achievable if there is a block code with this rate and
arbitrarily small error probability
We highlighted the following remarkable result:
Noisy-Channel Coding Theorem (Informal)
If Q is a channel with capacity C then the rate R is achievable if and only
if R ≤ C, that is, the rate is no greater than the channel capacity.
Ideally, we would like to know:
Can we go above C if we allow some fixed probability of error?
Is there a maximal rate for a fixed probability of error?
5 / 34
1 Noisy-Channel Coding Theorem
2 Joint Typicality
3 Proof Sketch of the NCCT
4 Good Codes vs. Practical Codes
5 Linear Codes
6 / 34
The Noisy-Channel Coding Theorem
Formal Statement
Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < �
The Noisy-Channel Coding Theorem (Formal)
1 Any rate R < C is achievable for Q
2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes
exists with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any p, we cannot achieve a rate greater than R(p) with
probability of bit error p.
Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so
we cannot achieve rate greater than C with probability of bit error
arbitrarily small
7 / 34
The Noisy-Channel Coding Theorem
Formal Statement
Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < �
The Noisy-Channel Coding Theorem (Formal)
1 Any rate R < C is achievable for Q
2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes
exists with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any p, we cannot achieve a rate greater than R(p) with
probability of bit error p.
Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so
we cannot achieve rate greater than C with probability of bit error
arbitrarily small
7 / 34
The Noisy-Channel Coding Theorem
Formal Statement
Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < �
The Noisy-Channel Coding Theorem (Formal)
1 Any rate R < C is achievable for Q
2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes
exists with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any p, we cannot achieve a rate greater than R(p) with
probability of bit error p.
Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so
we cannot achieve rate greater than C with probability of bit error
arbitrarily small
7 / 34
The Noisy-Channel Coding Theorem
Formal Statement
Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < �
The Noisy-Channel Coding Theorem (Formal)
1 Any rate R < C is achievable for Q
2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes
exists with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any p, we cannot achieve a rate greater than R(p) with
probability of bit error p.
Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so
we cannot achieve rate greater than C with probability of bit error
arbitrarily small
7 / 34
The Noisy-Channel Coding Theorem
Formal Statement
Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < �
The Noisy-Channel Coding Theorem (Formal)
1 Any rate R < C is achievable for Q
2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes
exists with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any p, we cannot achieve a rate greater than R(p) with
probability of bit error p.
Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so
we cannot achieve rate greater than C with probability of bit error
arbitrarily small
7 / 34
The Noisy-Channel Coding Theorem
Formal Statement
Recall: a rate is achievable if for any tolerance � > 0, an (N,K ) code with
rate K/N ≥ R exists with max. block error pBM < �
The Noisy-Channel Coding Theorem (Formal)
1 Any rate R < C is achievable for Q
2 If probability of bit error pb := pB/K is acceptable, (N,K ) codes
exists with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any p, we cannot achieve a rate greater than R(p) with
probability of bit error p.
Note that as pb → 12 , R(pb)→ +∞, while as pb → {0, 1}, R(pb)→ C, so
we cannot achieve rate greater than C with probability of bit error
arbitrarily small
7 / 34
Implications of NCCT
Suppose we know a channel has capacity 0.6 bits
We cannot achieve a rate of 0.8 with arbitrarily small error
We can achieve a rate of 0.8 with probability of bit error 5%, since
0.6
1−H2(0.05) = 0.8408 > 0.8
8 / 34
Implications of NCCT
Suppose we know a channel has capacity 0.6 bits
We cannot achieve a rate of 0.8 with arbitrarily small error
We can achieve a rate of 0.8 with probability of bit error 5%, since
0.6
1−H2(0.05) = 0.8408 > 0.8
8 / 34
Implications of NCCT
Suppose we know a channel has capacity 0.6 bits
We cannot achieve a rate of 0.8 with arbitrarily small error
We can achieve a rate of 0.8 with probability of bit error 5%, since
0.6
1−H2(0.05) = 0.8408 > 0.8
8 / 34
1 Noisy-Channel Coding Theorem
2 Joint Typicality
3 Proof Sketch of the NCCT
4 Good Codes vs. Practical Codes
5 Linear Codes
9 / 34
Joint Typicality
Recall that a random variable z from Z N is typical for an ensemble Z
whenever its average symbol information is within β of the entropy H(Z )
∣∣∣∣−
1
N
log2 P(z)− H(Z )
∣∣∣∣ < β
Joint Typicality
A pair of sequences x ∈ ANX and y ∈ ANY , each of length N, are jointly
typical (to tolerance β) for distribution P(x , y) if
1 x is typical of P(x) [z = x above]
2 y is typical of P(y) [z = y above]
3 (x, y) is typical of P(x, y) [z = (x, y) above]
The jointly typical set of all such pairs is denoted JNβ .
10 / 34
Joint Typicality
Recall that a random variable z from Z N is typical for an ensemble Z
whenever its average symbol information is within β of the entropy H(Z )
∣∣∣∣−
1
N
log2 P(z)− H(Z )
∣∣∣∣ < β
Joint Typicality
A pair of sequences x ∈ ANX and y ∈ ANY , each of length N, are jointly
typical (to tolerance β) for distribution P(x , y) if
1 x is typical of P(x) [z = x above]
2 y is typical of P(y) [z = y above]
3 (x, y) is typical of P(x, y) [z = (x, y) above]
The jointly typical set of all such pairs is denoted JNβ .
10 / 34
Joint Typicality
Example
Example (pX = (0.9, 0.1) and BSC with f = 0.2):
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
10.2: Jointly-typical sequences 163
The jointly-typical set JN! is the set of all jointly-typical sequence pairs
of length N .
Example. Here is a jointly-typical pair of length N = 100 for the ensemble
P (x, y) in which P (x) has (p0, p1) = (0.9, 0.1) and P (y |x) corresponds to a
binary symmetric channel with noise level 0.2.
x 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
y 0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111
Notice that x has 10 1s, and so is typical of the probability P (x) (at any
tolerance !); and y has 26 1s, so it is typical of P (y) (because P (y =1) = 0.26);
and x and y di!er in 20 bits, which is the typical number of flips for this
channel.
Joint typicality theorem. Let x,y be drawn from the ensemble (XY )N
defined by
P (x,y) =
N!
n=1
P (xn, yn).
Then
1. the probability that x,y are jointly typical (to tolerance !) tends
to 1 as N ! ";
2. the number of jointly-typical sequences |JN! | is close to 2NH(X,Y ).
To be precise,
|JN! | # 2N(H(X,Y )+!); (10.3)
3. if x! $ XN and y! $ Y N , i.e., x! and y! are independent samples
with the same marginal distribution as P (x,y), then the probability
that (x!,y!) lands in the jointly-typical set is about 2"NI(X;Y ). To
be precise,
P ((x!,y!) % JN!) # 2"N(I(X;Y )"3!). (10.4)
Proof. The proof of parts 1 and 2 by the law of large numbers follows that
of the source coding theorem in Chapter 4. For part 2, let the pair x, y
play the role of x in the source coding theorem, replacing P (x) there by
the probability distribution P (x, y).
For the third part,
P ((x!,y!) % JN!) =
"
(x,y)#JN!
P (x)P (y) (10.5)
# |JN! | 2"N(H(X)"!) 2"N(H(Y )"!) (10.6)
# 2N(H(X,Y )+!)"N(H(X)+H(Y )"2!) (10.7)
= 2"N(I(X;Y )"3!). ! (10.8)
A cartoon of the jointly-typical set is shown in figure 10.2. Two independent
typical vectors are jointly typical with probability
P ((x!,y!) % JN!) & 2"N(I(X;Y )) (10.9)
because the total number of independent typical pairs is the area of the dashed
rectangle, 2NH(X)2NH(Y ), and the number of jointly-typical pairs is roughly
2NH(X,Y ), so the probability of hitting a jointly-typical pair is roughly
2NH(X,Y )/2NH(X)+NH(Y ) = 2"NI(X;Y ). (10.10)
Here:
x has 10 1’s (c.f. p(X = 1) = 0.1)
y has 26 1’s (c.f. p(Y = 1) = (0.8)(0.1) + (0.2)(0.9) = 0.26)
x, y differ in 20 bits (c.f. p(X 6= Y ) = 0.2)
I This is essential in addition to the above two facts
11 / 34
Joint Typicality
CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! "
#
$ "!
2NH(X)$
#
2NH(Y )
"!
"!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
There are approximately:
2NH(X) typical x ∈ ANX
2NH(Y ) typical y ∈ ANY
2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY
2NH(Y |X) typical y given x
Thus, by selecting independent typical
vectors, we arrive at a jointly typical vector
with probability approximately
2NH(X ,Y )
2NH(X) · 2NH(Y )
= 2−NI(X ;Y )
Here we used
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y )
12 / 34
Joint Typicality
CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! "
#
$ "!
2NH(X)$
#
2NH(Y )
"!
"!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
There are approximately:
2NH(X) typical x ∈ ANX
2NH(Y ) typical y ∈ ANY
2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY
2NH(Y |X) typical y given x
Thus, by selecting independent typical
vectors, we arrive at a jointly typical vector
with probability approximately
2NH(X ,Y )
2NH(X) · 2NH(Y )
= 2−NI(X ;Y )
Here we used
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y )
12 / 34
Joint Typicality
CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! "
#
$ "!
2NH(X)$
#
2NH(Y )
"!
"!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
There are approximately:
2NH(X) typical x ∈ ANX
2NH(Y ) typical y ∈ ANY
2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY
2NH(Y |X) typical y given x
Thus, by selecting independent typical
vectors, we arrive at a jointly typical vector
with probability approximately
2NH(X ,Y )
2NH(X) · 2NH(Y )
= 2−NI(X ;Y )
Here we used
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y )
12 / 34
Joint Typicality
CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! "
#
$ "!
2NH(X)$
#
2NH(Y )
"!
"!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
There are approximately:
2NH(X) typical x ∈ ANX
2NH(Y ) typical y ∈ ANY
2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY
2NH(Y |X) typical y given x
Thus, by selecting independent typical
vectors, we arrive at a jointly typical vector
with probability approximately
2NH(X ,Y )
2NH(X) · 2NH(Y )
= 2−NI(X ;Y )
Here we used
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y )
12 / 34
Joint Typicality
CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! "
#
$ "!
2NH(X)$
#
2NH(Y )
"!
"!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
There are approximately:
2NH(X) typical x ∈ ANX
2NH(Y ) typical y ∈ ANY
2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY
2NH(Y |X) typical y given x
Thus, by selecting independent typical
vectors, we arrive at a jointly typical vector
with probability approximately
2NH(X ,Y )
2NH(X) · 2NH(Y )
= 2−NI(X ;Y )
Here we used
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y )
12 / 34
Joint Typicality
CountsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! "
#
$ "!
2NH(X)$
#
2NH(Y )
"!
"!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
There are approximately:
2NH(X) typical x ∈ ANX
2NH(Y ) typical y ∈ ANY
2NH(X ,Y ) typical (x, y) ∈ ANX ×ANY
2NH(Y |X) typical y given x
Thus, by selecting independent typical
vectors, we arrive at a jointly typical vector
with probability approximately
2NH(X ,Y )
2NH(X) · 2NH(Y )
= 2−NI(X ;Y )
Here we used
I(X ;Y ) = H(X ) + H(Y )− H(X ,Y )
12 / 34
Joint Typicality Theorem
Let x, y be drawn from (XY )N with P(x, y) =
∏
n P(xn, yn).
Joint Typicality Theorem
For all tolerances β > 0
1 Almost every pair is eventually jointly typical
P((x, y) ∈ JNβ)→ 1 as N →∞
2 The number of jointly typical sequences is
roughly 2NH(X ,Y ):
|JNβ| ≤ 2N(H(X ,Y )+β)
3 For x′ and y′ drawn independently from the
marginals of P(x, y),
P((x′, y′) ∈ JNβ) ≤ 2−N(I(X ;Y )−3β)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! ”
#
$ “!
2NH(X)$
#
2NH(Y )
“!
“!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
13 / 34
Joint Typicality Theorem
Let x, y be drawn from (XY )N with P(x, y) =
∏
n P(xn, yn).
Joint Typicality Theorem
For all tolerances β > 0
1 Almost every pair is eventually jointly typical
P((x, y) ∈ JNβ)→ 1 as N →∞
2 The number of jointly typical sequences is
roughly 2NH(X ,Y ):
|JNβ| ≤ 2N(H(X ,Y )+β)
3 For x′ and y′ drawn independently from the
marginals of P(x, y),
P((x′, y′) ∈ JNβ) ≤ 2−N(I(X ;Y )−3β)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! ”
#
$ “!
2NH(X)$
#
2NH(Y )
“!
“!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
13 / 34
Joint Typicality Theorem
Let x, y be drawn from (XY )N with P(x, y) =
∏
n P(xn, yn).
Joint Typicality Theorem
For all tolerances β > 0
1 Almost every pair is eventually jointly typical
P((x, y) ∈ JNβ)→ 1 as N →∞
2 The number of jointly typical sequences is
roughly 2NH(X ,Y ):
|JNβ| ≤ 2N(H(X ,Y )+β)
3 For x′ and y′ drawn independently from the
marginals of P(x, y),
P((x′, y′) ∈ JNβ) ≤ 2−N(I(X ;Y )−3β)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
164 10 — The Noisy-Channel Coding Theorem
! ”
#
$ “!
2NH(X)$
#
2NH(Y )
“!
“!
2NH(X|Y )
$
#
$
#
2NH(Y |X)
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
2NH(X,Y ) dots
ANX
ANY
Figure 10.2. The jointly-typical
set. The horizontal direction
represents ANX , the set of all input
strings of length N . The vertical
direction represents ANY , the set of
all output strings of length N .
The outer box contains all
conceivable input–output pairs.
Each dot represents a
jointly-typical pair of sequences
(x,y). The total number of
jointly-typical sequences is about
2NH(X,Y ).
10.3 Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred
babies who weighs less than 10 kg. Individual babies are di!cult to catch and
weigh. Shannon’s method of solving the task is to scoop up all the babies
Figure 10.3. Shannon’s method for
proving one baby weighs less than
10 kg.
and weigh them all at once on a big weighing machine. If we find that their
average weight is smaller than 10 kg, there must exist at least one baby who
weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t
guaranteed to reveal the existence of an underweight child, since it relies on
there being a tiny number of elephants in the class. But if we use his method
and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small prob-
ability of error. Evaluating the probability of error of any particular coding
and decoding system is not easy. Shannon’s innovation was this: instead of
constructing a good coding and decoding system and evaluating its error prob-
ability, Shannon calculated the average probability of block error of all codes,
and proved that this average is small. There must then exist individual codes
that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding–decoding system, whose rate is R !.
1. We fix P (x) and generate the S = 2NR
!
codewords of a (N,NR!) =
13 / 34
1 Noisy-Channel Coding Theorem
2 Joint Typicality
3 Proof Sketch of the NCCT
4 Good Codes vs. Practical Codes
5 Linear Codes
14 / 34
The Noisy-Channel Coding Theorem
Let Q be a channel with inputs AX and outputs AY .
Let C = maxpX I(X ;Y ) be the capacity of Q and
H2(p) = −p log2 p − (1− p) log2(1− p).
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
10
The Noisy-Channel Coding Theorem
10.1 The theorem
The theorem has three parts, two positive and one negative. The main positive
result is the first.
!
”
#
#
#
#
C R
R(pb)
pb
1 2
3
Figure 10.1. Portion of the R, pb
plane to be proved achievable
(1, 2) and not achievable (3).
1. For every discrete memoryless channel, the channel capacity
C = max
PX
I(X;Y ) (10.1)
has the following property. For any ! > 0 and R < C, for large enough N , there exists a code of length N and rate ! R and a decoding algorithm, such that the maximal probability of block error is < !. 2. If a probability of bit error pb is acceptable, rates up to R(pb) are achiev- able, where R(pb) = C 1 " H2(pb) . (10.2) 3. For any pb, rates greater than R(pb) are not achievable. 10.2 Jointly-typical sequences We formalize the intuitive preview of the last chapter. We will define codewords x(s) as coming from an ensemble XN , and con- sider the random selection of one codeword and a corresponding channel out- put y, thus defining a joint ensemble (XY )N . We will use a typical-set decoder, which decodes a received signal y as s if x(s) and y are jointly typical, a term to be defined shortly. The proof will then centre on determining the probabilities (a) that the true input codeword is not jointly typical with the output sequence; and (b) that a false input codeword is jointly typical with the output. We will show that, for large N , both probabilities go to zero as long as there are fewer than 2NC codewords, and the ensemble X is the optimal input distribution. Joint typicality. A pair of sequences x,y of length N are defined to be jointly typical (to tolerance ") with respect to the distribution P (x, y) if x is typical of P (x), i.e., !!!! 1 N log 1 P (x) " H(X) !!!! < ", y is typical of P (y), i.e., !!!! 1 N log 1 P (y) " H(Y ) !!!! < ", and x,y is typical of P (x,y), i.e., !!!! 1 N log 1 P (x,y) " H(X,Y ) !!!! < ". 162 The Noisy-Channel Coding Theorem 1 Any rate R < C is achievable for Q (i.e., for any tolerance � > 0, an
(N,K ) code with rate K/N ≥ R exists with max. block error pBM < �) 2 If probability of bit error pb := pB/K is acceptable, there exist (N,K ) codes with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any pb, rates greater than R(pb) are not achievable. 15 / 34 The Noisy-Channel Coding Theorem Let Q be a channel with inputs AX and outputs AY . Let C = maxpX I(X ;Y ) be the capacity of Q and H2(p) = −p log2 p − (1− p) log2(1− p). Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 The Noisy-Channel Coding Theorem 10.1 The theorem The theorem has three parts, two positive and one negative. The main positive result is the first. ! " # # # # C R R(pb) pb 1 2 3 Figure 10.1. Portion of the R, pb plane to be proved achievable (1, 2) and not achievable (3). 1. For every discrete memoryless channel, the channel capacity C = max PX I(X;Y ) (10.1) has the following property. For any ! > 0 and R < C, for large enough N , there exists a code of length N and rate ! R and a decoding algorithm, such that the maximal probability of block error is < !. 2. If a probability of bit error pb is acceptable, rates up to R(pb) are achiev- able, where R(pb) = C 1 " H2(pb) . (10.2) 3. For any pb, rates greater than R(pb) are not achievable. 10.2 Jointly-typical sequences We formalize the intuitive preview of the last chapter. We will define codewords x(s) as coming from an ensemble XN , and con- sider the random selection of one codeword and a corresponding channel out- put y, thus defining a joint ensemble (XY )N . We will use a typical-set decoder, which decodes a received signal y as s if x(s) and y are jointly typical, a term to be defined shortly. The proof will then centre on determining the probabilities (a) that the true input codeword is not jointly typical with the output sequence; and (b) that a false input codeword is jointly typical with the output. We will show that, for large N , both probabilities go to zero as long as there are fewer than 2NC codewords, and the ensemble X is the optimal input distribution. Joint typicality. A pair of sequences x,y of length N are defined to be jointly typical (to tolerance ") with respect to the distribution P (x, y) if x is typical of P (x), i.e., !!!! 1 N log 1 P (x) " H(X) !!!! < ", y is typical of P (y), i.e., !!!! 1 N log 1 P (y) " H(Y ) !!!! < ", and x,y is typical of P (x,y), i.e., !!!! 1 N log 1 P (x,y) " H(X,Y ) !!!! < ". 162 The Noisy-Channel Coding Theorem 1 Any rate R < C is achievable for Q (i.e., for any tolerance � > 0, an
(N,K ) code with rate K/N ≥ R exists with max. block error pBM < �) 2 If probability of bit error pb := pB/K is acceptable, there exist (N,K ) codes with rates K N ≤ R(pb) = C 1− H2(pb) 3 For any pb, rates greater than R(pb) are not achievable. 15 / 34 The Noisy-Channel Coding Theorem Let Q be a channel with inputs AX and outputs AY . Let C = maxpX I(X ;Y ) be the capacity of Q and H2(p) = −p log2 p − (1− p) log2(1− p). Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 The Noisy-Channel Coding Theorem 10.1 The theorem The theorem has three parts, two positive and one negative. The main positive result is the first. ! " # # # # C R R(pb) pb 1 2 3 Figure 10.1. Portion of the R, pb plane to be proved achievable (1, 2) and not achievable (3). 1. For every discrete memoryless channel, the channel capacity C = max PX I(X;Y ) (10.1) has the following property. For any ! > 0 and R < C, for large enough N , there exists a code of length N and rate ! R and a decoding algorithm, such that the maximal probability of block error is < !. 2. If a probability of bit error pb is acceptable, rates up to R(pb) are achiev- able, where R(pb) = C 1 " H2(pb) . (10.2) 3. For any pb, rates greater than R(pb) are not achievable. 10.2 Jointly-typical sequences We formalize the intuitive preview of the last chapter. We will define codewords x(s) as coming from an ensemble XN , and con- sider the random selection of one codeword and a corresponding channel out- put y, thus defining a joint ensemble (XY )N . We will use a typical-set decoder, which decodes a received signal y as s if x(s) and y are jointly typical, a term to be defined shortly. The proof will then centre on determining the probabilities (a) that the true input codeword is not jointly typical with the output sequence; and (b) that a false input codeword is jointly typical with the output. We will show that, for large N , both probabilities go to zero as long as there are fewer than 2NC codewords, and the ensemble X is the optimal input distribution. Joint typicality. A pair of sequences x,y of length N are defined to be jointly typical (to tolerance ") with respect to the distribution P (x, y) if x is typical of P (x), i.e., !!!! 1 N log 1 P (x) " H(X) !!!! < ", y is typical of P (y), i.e., !!!! 1 N log 1 P (y) " H(Y ) !!!! < ", and x,y is typical of P (x,y), i.e., !!!! 1 N log 1 P (x,y) " H(X,Y ) !!!! < ". 162 The Noisy-Channel Coding Theorem 1 Any rate R < C is achievable for Q (i.e., for any tolerance � > 0, an
(N,K ) code with rate K/N ≥ R exists with max. block error pBM < �)
2 If probability of bit error pb := pB/K is acceptable, there exist (N,K )
codes with rates
K
N
≤ R(pb) =
C
1− H2(pb)
3 For any pb, rates greater than R(pb) are not achievable.
15 / 34
Some Intuition for the NCCT
The proof of the NCCT is based on the following observations:
Each choice of input distribution pX induces an output distibution pY
There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y ))
For each x there are 2NH(Y |X) typical y for x
At most there are 2
NH(Y )
2NH(Y |X)
= 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint
typical y. Coding with these x minimises error
Best rate K/N achieved when number of such x (i.e., 2K ) is
maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC
1111
0111
1011
0011
1101
0101
1001
0001
1110
0110
1010
0010
1100
0100
1000
0000
00
00
10
00
01
00
11
00
00
10
10
10
01
10
11
10
00
01
10
01
01
01
11
01
00
11
10
11
01
11
11
11
= 4
16 / 34
Some Intuition for the NCCT
The proof of the NCCT is based on the following observations:
Each choice of input distribution pX induces an output distibution pY
There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y ))
For each x there are 2NH(Y |X) typical y for x
At most there are 2
NH(Y )
2NH(Y |X)
= 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint
typical y. Coding with these x minimises error
Best rate K/N achieved when number of such x (i.e., 2K ) is
maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC
1111
0111
1011
0011
1101
0101
1001
0001
1110
0110
1010
0010
1100
0100
1000
0000
00
00
10
00
01
00
11
00
00
10
10
10
01
10
11
10
00
01
10
01
01
01
11
01
00
11
10
11
01
11
11
11
= 4
16 / 34
Some Intuition for the NCCT
The proof of the NCCT is based on the following observations:
Each choice of input distribution pX induces an output distibution pY
There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y ))
For each x there are 2NH(Y |X) typical y for x
At most there are 2
NH(Y )
2NH(Y |X)
= 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint
typical y. Coding with these x minimises error
Best rate K/N achieved when number of such x (i.e., 2K ) is
maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC
1111
0111
1011
0011
1101
0101
1001
0001
1110
0110
1010
0010
1100
0100
1000
0000
00
00
10
00
01
00
11
00
00
10
10
10
01
10
11
10
00
01
10
01
01
01
11
01
00
11
10
11
01
11
11
11
= 4
16 / 34
Some Intuition for the NCCT
The proof of the NCCT is based on the following observations:
Each choice of input distribution pX induces an output distibution pY
There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y ))
For each x there are 2NH(Y |X) typical y for x
At most there are 2
NH(Y )
2NH(Y |X)
= 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint
typical y. Coding with these x minimises error
Best rate K/N achieved when number of such x (i.e., 2K ) is
maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
154 9 — Communication over a Noisy Channel
1
0
0 1
11
01
10
00
00 10 01 11
1111
0111
1011
0011
1101
0101
1001
0001
1110
0110
1010
0010
1100
0100
1000
0000
00
00
10
00
01
00
11
00
00
10
10
10
01
10
11
10
00
01
10
01
01
01
11
01
00
11
10
11
01
11
11
11
N = 1 N = 2 N = 4
Figure 9.7. Extended channels
obtained from a Z channel with
transition probability 0.15. Each
column corresponds to an input,
and each row is a di!erent output.
ANY
!
"
#
$
Typical y
!"
#$
!"
#$!"
#$
!"
#$!"
#$!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$
!"
#$!"
#$!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$!"
#$
!"
#$!"
#$!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$
!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
%&
'(
!
Typical y for a given typical x
ANY
!
"
#
$
Typical y
!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
(a) (b)
Figure 9.8. (a) Some typical
outputs in ANY corresponding to
typical inputs x. (b) A subset of
the typical sets shown in (a) that
do not overlap each other. This
picture can be compared with the
solution to the noisy typewriter in
figure 9.5.
Exercise 9.14.[2, p.159] Find the transition probability matrices Q for the ex-
tended channel, with N = 2, derived from the binary erasure channel
having erasure probability 0.15.
By selecting two columns of this transition probability matrix, we can
define a rate-1/2 code for this channel with blocklength N = 2. What is
the best choice of two columns? What is the decoding algorithm?
To prove the noisy-channel coding theorem, we make use of large block-
lengths N . The intuitive idea is that, if N is large, an extended channel looks
a lot like the noisy typewriter. Any particular input x is very likely to produce
an output in a small subspace of the output alphabet – the typical output set,
given that input. So we can find a non-confusable subset of the inputs that
produce essentially disjoint output sequences. For a given N , let us consider
a way of generating such a non-confusable subset of the inputs, and count up
how many distinct inputs it contains.
Imagine making an input sequence x for the extended channel by drawing
it from an ensemble XN , where X is an arbitrary ensemble over the input
alphabet. Recall the source coding theorem of Chapter 4, and consider the
number of probable output sequences y. The total number of typical output
sequences y is 2NH(Y ), all having similar probability. For any particular typical
input sequence x, there are about 2NH(Y |X) probable sequences. Some of these
subsets of ANY are depicted by circles in figure 9.8a.
We now imagine restricting ourselves to a subset of the typical inputs
x such that the corresponding typical output sets do not overlap, as shown
in figure 9.8b. We can then bound the number of non-confusable inputs by
dividing the size of the typical y set, 2NH(Y ), by the size of each typical-y-
16 / 34
Some Intuition for the NCCT
The proof of the NCCT is based on the following observations:
Each choice of input distribution pX induces an output distibution pY
There are 2NH(Y ) typical y (i.e., with prob. per symbol ≈ H(Y ))
For each x there are 2NH(Y |X) typical y for x
At most there are 2
NH(Y )
2NH(Y |X)
= 2N(H(Y )−H(Y |X)) = 2NI(X ;Y ) x with disjoint
typical y. Coding with these x minimises error
Best rate K/N achieved when number of such x (i.e., 2K ) is
maximised: 2K ≤ maxpX 2NI(X ;Y ) = 2N maxpX I(X ;Y ) = 2NC
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
154 9 — Communication over a Noisy Channel
1
0
0 1
11
01
10
00
00 10 01 11
1111
0111
1011
0011
1101
0101
1001
0001
1110
0110
1010
0010
1100
0100
1000
0000
00
00
10
00
01
00
11
00
00
10
10
10
01
10
11
10
00
01
10
01
01
01
11
01
00
11
10
11
01
11
11
11
N = 1 N = 2 N = 4
Figure 9.7. Extended channels
obtained from a Z channel with
transition probability 0.15. Each
column corresponds to an input,
and each row is a di!erent output.
ANY
!
"
#
$
Typical y
!"
#$
!"
#$!"
#$
!"
#$!"
#$!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$
!"
#$!"
#$!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$!"
#$
!"
#$!"
#$!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
!"
#$
!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
%&
'(
!
Typical y for a given typical x
ANY
!
"
#
$
Typical y
!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$!"
#$
!"
#$
!"
#$
(a) (b)
Figure 9.8. (a) Some typical
outputs in ANY corresponding to
typical inputs x. (b) A subset of
the typical sets shown in (a) that
do not overlap each other. This
picture can be compared with the
solution to the noisy typewriter in
figure 9.5.
Exercise 9.14.[2, p.159] Find the transition probability matrices Q for the ex-
tended channel, with N = 2, derived from the binary erasure channel
having erasure probability 0.15.
By selecting two columns of this transition probability matrix, we can
define a rate-1/2 code for this channel with blocklength N = 2. What is
the best choice of two columns? What is the decoding algorithm?
To prove the noisy-channel coding theorem, we make use of large block-
lengths N . The intuitive idea is that, if N is large, an extended channel looks
a lot like the noisy typewriter. Any particular input x is very likely to produce
an output in a small subspace of the output alphabet – the typical output set,
given that input. So we can find a non-confusable subset of the inputs that
produce essentially disjoint output sequences. For a given N , let us consider
a way of generating such a non-confusable subset of the inputs, and count up
how many distinct inputs it contains.
Imagine making an input sequence x for the extended channel by drawing
it from an ensemble XN , where X is an arbitrary ensemble over the input
alphabet. Recall the source coding theorem of Chapter 4, and consider the
number of probable output sequences y. The total number of typical output
sequences y is 2NH(Y ), all having similar probability. For any particular typical
input sequence x, there are about 2NH(Y |X) probable sequences. Some of these
subsets of ANY are depicted by circles in figure 9.8a.
We now imagine restricting ourselves to a subset of the typical inputs
x such that the corresponding typical output sets do not overlap, as shown
in figure 9.8b. We can then bound the number of non-confusable inputs by
dividing the size of the typical y set, 2NH(Y ), by the size of each typical-y-
16 / 34
Proof Sketch of NCCT Part 1
We can:
define a family of random codes, which rely on joint typicality, and
which achieve the given rate
show that on average, such a code has a low probability of block error
deduce that at least one such code must have a low probability of
block error
“expurgate” the above code so that it has low maximal probability of
error
This will establish that the final code achieves low maximal probability of
error, while achieving the given rate!
17 / 34
Proof Sketch of NCCT Part 1
We can:
define a family of random codes, which rely on joint typicality, and
which achieve the given rate
show that on average, such a code has a low probability of block error
deduce that at least one such code must have a low probability of
block error
“expurgate” the above code so that it has low maximal probability of
error
This will establish that the final code achieves low maximal probability of
error, while achieving the given rate!
17 / 34
Proof Sketch of NCCT Part 1
We can:
define a family of random codes, which rely on joint typicality, and
which achieve the given rate
show that on average, such a code has a low probability of block error
deduce that at least one such code must have a low probability of
block error
“expurgate” the above code so that it has low maximal probability of
error
This will establish that the final code achieves low maximal probability of
error, while achieving the given rate!
17 / 34
Proof Sketch of NCCT Part 1
We can:
define a family of random codes, which rely on joint typicality, and
which achieve the given rate
show that on average, such a code has a low probability of block error
deduce that at least one such code must have a low probability of
block error
“expurgate” the above code so that it has low maximal probability of
error
This will establish that the final code achieves low maximal probability of
error, while achieving the given rate!
17 / 34
Random Coding and Typical Set Decoding
Make random code C with rate R′:
Fix pX and choose S = 2NR
′
codewords, x(1), . . . , x(S), each
with P(x) =
∏
n P(xn)
Decode y via typical sets:
If there is exactly one ŝ so that
(xŝ, y) are jointly typical then
decode y as ŝ
Otherwise, fail (ŝ = 0)
Errors:
pB(C) = P(ŝ 6= s|C)
〈pB〉 =
∑
C P(ŝ 6= s|C)P(C)
pBM(C) = maxs P(ŝ 6= s|s, C)
(Aim: ∃C s.t. pBM(C) small)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
10.3: Proof of the noisy-channel coding theorem 165
x(3) x(1) x(2) x(4)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
x(3) x(1) x(2) x(4)
yc
yd
yb
ya
!
!
!
!
ŝ(yc)=4
ŝ(yd)=0
ŝ(yb)= 3
ŝ(ya)=0
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
(a) (b)
Figure 10.4. (a) A random code.
(b) Example decodings by the
typical set decoder. A sequence
that is not jointly typical with any
of the codewords, such as ya, is
decoded as ŝ = 0. A sequence that
is jointly typical with codeword
x(3) alone, yb, is decoded as ŝ = 3.
Similarly, yc is decoded as ŝ = 4.
A sequence that is jointly typical
with more than one codeword,
such as yd, is decoded as ŝ = 0.
(N,K) code C at random according to
P (x) =
N!
n=1
P (xn). (10.11)
A random code is shown schematically in figure 10.4a.
2. The code is known to both sender and receiver.
3. A message s is chosen from {1, 2, . . . , 2NR!}, and x(s) is transmitted. The
received signal is y, with
P (y |x(s)) =
N!
n=1
P (yn |x(s)n ). (10.12)
4. The signal is decoded by typical-set decoding.
Typical-set decoding. Decode y as ŝ if (x(ŝ),y) are jointly typical and
there is no other s! such that (x(s
!),y) are jointly typical;
otherwise declare a failure (ŝ=0).
This is not the optimal decoding algorithm, but it will be good enough,
and easier to analyze. The typical-set decoder is illustrated in fig-
ure 10.4b.
5. A decoding error occurs if ŝ != s.
There are three probabilities of error that we can distinguish. First, there
is the probability of block error for a particular code C, that is,
pB(C) " P (ŝ != s | C). (10.13)
This is a di!cult quantity to evaluate for any given code.
Second, there is the average over all codes of this block error probability,
#pB$ "
"
C
P (ŝ != s | C)P (C). (10.14)
18 / 34
Random Coding and Typical Set Decoding
Make random code C with rate R′:
Fix pX and choose S = 2NR
′
codewords, x(1), . . . , x(S), each
with P(x) =
∏
n P(xn)
Decode y via typical sets:
If there is exactly one ŝ so that
(xŝ, y) are jointly typical then
decode y as ŝ
Otherwise, fail (ŝ = 0)
Errors:
pB(C) = P(ŝ 6= s|C)
〈pB〉 =
∑
C P(ŝ 6= s|C)P(C)
pBM(C) = maxs P(ŝ 6= s|s, C)
(Aim: ∃C s.t. pBM(C) small)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
10.3: Proof of the noisy-channel coding theorem 165
x(3) x(1) x(2) x(4)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
x(3) x(1) x(2) x(4)
yc
yd
yb
ya
!
!
!
!
ŝ(yc)=4
ŝ(yd)=0
ŝ(yb)= 3
ŝ(ya)=0
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
(a) (b)
Figure 10.4. (a) A random code.
(b) Example decodings by the
typical set decoder. A sequence
that is not jointly typical with any
of the codewords, such as ya, is
decoded as ŝ = 0. A sequence that
is jointly typical with codeword
x(3) alone, yb, is decoded as ŝ = 3.
Similarly, yc is decoded as ŝ = 4.
A sequence that is jointly typical
with more than one codeword,
such as yd, is decoded as ŝ = 0.
(N,K) code C at random according to
P (x) =
N!
n=1
P (xn). (10.11)
A random code is shown schematically in figure 10.4a.
2. The code is known to both sender and receiver.
3. A message s is chosen from {1, 2, . . . , 2NR!}, and x(s) is transmitted. The
received signal is y, with
P (y |x(s)) =
N!
n=1
P (yn |x(s)n ). (10.12)
4. The signal is decoded by typical-set decoding.
Typical-set decoding. Decode y as ŝ if (x(ŝ),y) are jointly typical and
there is no other s! such that (x(s
!),y) are jointly typical;
otherwise declare a failure (ŝ=0).
This is not the optimal decoding algorithm, but it will be good enough,
and easier to analyze. The typical-set decoder is illustrated in fig-
ure 10.4b.
5. A decoding error occurs if ŝ != s.
There are three probabilities of error that we can distinguish. First, there
is the probability of block error for a particular code C, that is,
pB(C) " P (ŝ != s | C). (10.13)
This is a di!cult quantity to evaluate for any given code.
Second, there is the average over all codes of this block error probability,
#pB$ "
"
C
P (ŝ != s | C)P (C). (10.14)
18 / 34
Random Coding and Typical Set Decoding
Make random code C with rate R′:
Fix pX and choose S = 2NR
′
codewords, x(1), . . . , x(S), each
with P(x) =
∏
n P(xn)
Decode y via typical sets:
If there is exactly one ŝ so that
(xŝ, y) are jointly typical then
decode y as ŝ
Otherwise, fail (ŝ = 0)
Errors:
pB(C) = P(ŝ 6= s|C)
〈pB〉 =
∑
C P(ŝ 6= s|C)P(C)
pBM(C) = maxs P(ŝ 6= s|s, C)
(Aim: ∃C s.t. pBM(C) small)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
10.3: Proof of the noisy-channel coding theorem 165
x(3) x(1) x(2) x(4)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
x(3) x(1) x(2) x(4)
yc
yd
yb
ya
!
!
!
!
ŝ(yc)=4
ŝ(yd)=0
ŝ(yb)= 3
ŝ(ya)=0
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
(a) (b)
Figure 10.4. (a) A random code.
(b) Example decodings by the
typical set decoder. A sequence
that is not jointly typical with any
of the codewords, such as ya, is
decoded as ŝ = 0. A sequence that
is jointly typical with codeword
x(3) alone, yb, is decoded as ŝ = 3.
Similarly, yc is decoded as ŝ = 4.
A sequence that is jointly typical
with more than one codeword,
such as yd, is decoded as ŝ = 0.
(N,K) code C at random according to
P (x) =
N!
n=1
P (xn). (10.11)
A random code is shown schematically in figure 10.4a.
2. The code is known to both sender and receiver.
3. A message s is chosen from {1, 2, . . . , 2NR!}, and x(s) is transmitted. The
received signal is y, with
P (y |x(s)) =
N!
n=1
P (yn |x(s)n ). (10.12)
4. The signal is decoded by typical-set decoding.
Typical-set decoding. Decode y as ŝ if (x(ŝ),y) are jointly typical and
there is no other s! such that (x(s
!),y) are jointly typical;
otherwise declare a failure (ŝ=0).
This is not the optimal decoding algorithm, but it will be good enough,
and easier to analyze. The typical-set decoder is illustrated in fig-
ure 10.4b.
5. A decoding error occurs if ŝ != s.
There are three probabilities of error that we can distinguish. First, there
is the probability of block error for a particular code C, that is,
pB(C) " P (ŝ != s | C). (10.13)
This is a di!cult quantity to evaluate for any given code.
Second, there is the average over all codes of this block error probability,
#pB$ "
"
C
P (ŝ != s | C)P (C). (10.14)
18 / 34
Average Error Over All Codes
Let’s consider the average error over random codes:
〈pB〉 =
∑
C
P(ŝ 6= s|C)P(C)
A bound on the average 〈f 〉 of some function f of random variables z ∈ Z
with probabilities P(z) guarantees there is at least one z∗ ∈ Z such that
f (z∗) is smaller than the bound.1
So 〈pB〉 < δ =⇒ pB(C∗) < δ for some C∗.
Analogy: Suppose the average height of class is not more than 160 cm.
Then one of you must be shorter than 160 cm.
1If 〈f 〉 < δ but f (z) ≥ δ for all z, 〈f 〉 =
∑
z f (z)P(z) ≥
∑
z δP(z) = δ !!
19 / 34
Average Error Over All Codes
Let’s consider the average error over random codes:
〈pB〉 =
∑
C
P(ŝ 6= s|C)P(C)
A bound on the average 〈f 〉 of some function f of random variables z ∈ Z
with probabilities P(z) guarantees there is at least one z∗ ∈ Z such that
f (z∗) is smaller than the bound.1
So 〈pB〉 < δ =⇒ pB(C∗) < δ for some C∗.
Analogy: Suppose the average height of class is not more than 160 cm.
Then one of you must be shorter than 160 cm.
1If 〈f 〉 < δ but f (z) ≥ δ for all z, 〈f 〉 =
∑
z f (z)P(z) ≥
∑
z δP(z) = δ !!
19 / 34
Proof Sketch of NCCT Part 1
Want to prove
Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate
N/K ≥ R exists with max. block error pBM < � for any tolerance �)
Let us thus bound 〈pB〉 for our random code
Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)
such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)
3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β
4 Choosing maximal P(x) makes required condition R′ < C − 3β
20 / 34
Proof Sketch of NCCT Part 1
Want to prove
Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate
N/K ≥ R exists with max. block error pBM < � for any tolerance �)
Let us thus bound 〈pB〉 for our random code
Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)
such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)
〈pB〉 =
∑
atypical (x,y)
P(ŝ 6= s|·) +
∑
typical (x,y)
P(ŝ 6= s|·)
3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β
4 Choosing maximal P(x) makes required condition R′ < C − 3β
20 / 34
Proof Sketch of NCCT Part 1
Want to prove
Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate
N/K ≥ R exists with max. block error pBM < � for any tolerance �)
Let us thus bound 〈pB〉 for our random code
Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)
such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)
〈pB〉 ≤ δ +
2NR
′
∑
s′=2
2−N(I(X ;Y )−3β)
3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β
4 Choosing maximal P(x) makes required condition R′ < C − 3β
20 / 34
Proof Sketch of NCCT Part 1
Want to prove
Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate
N/K ≥ R exists with max. block error pBM < � for any tolerance �)
Let us thus bound 〈pB〉 for our random code
Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)
such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)
〈pB〉 ≤ δ + 2−N(I(X ;Y )−R
′−3β)
3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β
4 Choosing maximal P(x) makes required condition R′ < C − 3β
20 / 34
Proof Sketch of NCCT Part 1
Want to prove
Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate
N/K ≥ R exists with max. block error pBM < � for any tolerance �)
Let us thus bound 〈pB〉 for our random code
Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)
such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)
〈pB〉 ≤ δ + 2−N(I(X ;Y )−R
′−3β)
3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β
4 Choosing maximal P(x) makes required condition R′ < C − 3β
20 / 34
Proof Sketch of NCCT Part 1
Want to prove
Any rate R < C is achievable for Q (i.e., an (N,K ) code with rate
N/K ≥ R exists with max. block error pBM < � for any tolerance �)
Let us thus bound 〈pB〉 for our random code
Choose some δ > 0
1 Part one of the Joint Typicality Theorem says we can find an N(δ)
such that the probability (x, y) are not jointly typical is less than δ.
2 Thus, the average probability of error satisfies (by Part 3 of JCT)
〈pB〉 ≤ δ + 2−N(I(X ;Y )−R
′−3β)
3 Increasing N will make 〈pB〉 < 2δ if R′ < I(X ;Y )− 3β 4 Choosing maximal P(x) makes required condition R′ < C − 3β 20 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Code Expurgation The last main “trick” is to show that if there is an (N,K ) code with rate R′ and pB(C) < δ we can construct a new (N,K ′) code C′ with rate R′ − 1N and maximum probability of error pBM(C′) < 2δ. We create C′ by expurgating (throwing out) half the codewords from C, specifically the half with the largest conditional probability of error. Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10.4: Communication (with errors) above capacity 167 ! (a) A random code . . . (b) expurgated Figure 10.5. How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions – pairs of codewords are su!ciently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. 2. Since the average probability of error over all codes is < 2!, there must exist a code with mean probability of block error pB(C) < 2!. 3. To show that not only the average but also the maximal probability of error, pBM, can be made small, we modify this code by throwing away the worst half of the codewords – the ones most likely to produce errors. Those that remain must all have conditional probability of error less than 4!. We use these remaining codewords to define a new code. This new code has 2NR !!1 codewords, i.e., we have reduced the rate from R" to R"!1/N (a negligible reduction, if N is large), and achieved pBM < 4!. This trick is called expurgation (figure 10.5). The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here. In conclusion, we can ‘construct’ a code of rate R" ! 1/N, where R" < C ! 3", with maximal probability of error < 4!. We obtain the theorem as stated by setting R" = (R + C)/2, ! = #/4, " < (C ! R")/3, and N su!ciently large for the remaining conditions to hold. The theorem’s first part is thus proved. ! 10.4 Communication (with errors) above capacity ! " C R pb achievable Figure 10.6. Portion of the R, pb plane proved achievable in the first part of the theorem. [We’ve proved that the maximal probability of block error pBM can be made arbitrarily small, so the same goes for the bit error probability pb, which must be smaller than pBM.] We have proved, for any discrete memoryless channel, the achievability of a portion of the R, pb plane shown in figure 10.6. We have shown that we can turn any noisy channel into an essentially noiseless binary channel with rate up to C bits per cycle. We now extend the right-hand boundary of the region of achievability at non-zero error probabilities. [This is called rate-distortion theory.] We do this with a new trick. Since we know we can make the noisy channel into a perfect channel with a smaller rate, it is su!cient to consider commu- nication with errors over a noiseless channel. How fast can we communicate over a noiseless channel, if we are allowed to make errors? Consider a noiseless binary channel, and assume that we force communi- cation at a rate greater than its capacity of 1 bit. For example, if we require the sender to attempt to communicate at R=2 bits per cycle then he must e"ectively throw away half of the information. What is the best way to do this if the aim is to achieve the smallest possible probability of bit error? One simple strategy is to communicate a fraction 1/R of the source bits, and ignore the rest. The receiver guesses the missing fraction 1 ! 1/R at random, and Proof: Code C′ has 2NR′/2 = 2NR′−1 messages, so rate of K ′/N = R′ − 1N . Suppose pBM(C′) = maxs P(ŝ 6= s|s, C′) ≥ 2δ, then every s ∈ C that was thrown out must have conditional probability P(ŝ 6= s|s, C) ≥ 2δ But then pB(C) = ∑ s P(ŝ 6= s|s, C)P(s) ≥ 1 2 ∑ s/∈C′ 2δ + 1 2 ∑ s∈C′ P(ŝ 6= s|s, C) ≥ δ 21 / 34 Wrapping It All Up From the previous slide, 〈pB〉 < 2δ =⇒ some C′ such that pBM(C′) < 4δ with rate R′ − 1N Setting R′ = (R + C)/2, δ = �/4, β < (C − R′)/3 gives the result! 22 / 34 NCCT Part 1: Comments NCCT shows the existence of good codes; actually constructing practical codes is another matter In principle, one could try the coding scheme outlined in the proof However, it would require a lookup in an exponential sized table (for the typical set decoding)! Over the past few decades, some codes (e.g. Turbo codes) have been shown to achieve rate close to the Shannon capacity Beyond the scope of this course! 23 / 34 NCCT Converse: Comments One can in fact make a stronger statement about pB,avg = 1 2K ∑ sin P(sout 6= sin | sin), the probability of block error assuming a uniform distribution over inputs We have: pB,avg ≥ 1− O(e−N(R−C)) Thus, if R > C, the probability of block error shoots to 1 as N increases!
We have a “phase transition” around C between perfectly reliable and
perfectly unreliable communication!
24 / 34
1 Noisy-Channel Coding Theorem
2 Joint Typicality
3 Proof Sketch of the NCCT
4 Good Codes vs. Practical Codes
5 Linear Codes
25 / 34
Theory and Practice
The difference between theory and practice is that, in theory,
there is no difference between theory and practice but, in
practice, there is.
— Jan L. A. van de Snepscheut
Theory vs. Practice
The NCCT theorem tells us that good block codes exist for any noisy
channel (in fact, most random codes are good)
However, the theorem is non-constructive: it does not tell us how to
create practical codes for a given noisy channel
The construction of practical codes that achieve rates up to the
capacity for general channels is ongoing research
26 / 34
Theory and Practice
The difference between theory and practice is that, in theory,
there is no difference between theory and practice but, in
practice, there is.
— Jan L. A. van de Snepscheut
Theory vs. Practice
The NCCT theorem tells us that good block codes exist for any noisy
channel (in fact, most random codes are good)
However, the theorem is non-constructive: it does not tell us how to
create practical codes for a given noisy channel
The construction of practical codes that achieve rates up to the
capacity for general channels is ongoing research
26 / 34
Types of Codes
When we talk about types of codes we will be referring to schemes for
creating (N,K ) codes for any size N. MacKay makes the following
distinctions:
Bad: Cannot achieve arbitrarily small error, or only achieve it if the
rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �)
Very Good: Can achieve arbitrarily small error at any rate up to the
channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �)
Practical: Can be coded and decoded in time that is polynomial in
the block length N.
27 / 34
Types of Codes
When we talk about types of codes we will be referring to schemes for
creating (N,K ) codes for any size N. MacKay makes the following
distinctions:
Bad: Cannot achieve arbitrarily small error, or only achieve it if the
rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �)
Very Good: Can achieve arbitrarily small error at any rate up to the
channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �)
Practical: Can be coded and decoded in time that is polynomial in
the block length N.
27 / 34
Types of Codes
When we talk about types of codes we will be referring to schemes for
creating (N,K ) codes for any size N. MacKay makes the following
distinctions:
Bad: Cannot achieve arbitrarily small error, or only achieve it if the
rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �)
Very Good: Can achieve arbitrarily small error at any rate up to the
channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �)
Practical: Can be coded and decoded in time that is polynomial in
the block length N.
27 / 34
Types of Codes
When we talk about types of codes we will be referring to schemes for
creating (N,K ) codes for any size N. MacKay makes the following
distinctions:
Bad: Cannot achieve arbitrarily small error, or only achieve it if the
rate goes to zero (i.e., either pBM → a > 0 as N →∞ or
pBM → 0 =⇒ K/N → 0)
Good: Can achieve arbitrarily small error up to some maximum rate
strictly less than the channel capacity (i.e, for any � a good coding
scheme can make a code with K/N = Rmax < C and pBM < �)
Very Good: Can achieve arbitrarily small error at any rate up to the
channel capacity (i.e., for any � > 0 a very good coding scheme can
make a code with K/N = C and pBM < �)
Practical: Can be coded and decoded in time that is polynomial in
the block length N.
27 / 34
Random Codes
During the discussion of the Noisy-Channel Coding Theorem we saw how
to construct very good random codes via typical set decoding
Properties:
Very Good: Rates up to C are achievable with arbitrarily small error
Construction is easy
Not Practical:
I The 2K codewords have no structure and must be “memorised”
I Typical set decoding is expensive
28 / 34
1 Noisy-Channel Coding Theorem
2 Joint Typicality
3 Proof Sketch of the NCCT
4 Good Codes vs. Practical Codes
5 Linear Codes
29 / 34
Linear Codes
(N,K ) Block Code
An (N,K ) block code is a list of S = 2K codewords {x(1), . . . , x(S)}, each
of length N. A message s ∈ {1, 2, . . . , 2K} is encoded as x(s).
Linear (N,K ) Block Code
A linear (N,K ) block code is an (N,K ) block code where s is first
represented as a K -bit binary vector s ∈ {0, 1}K and then encoded via
multiplication by an N × K binary matrix G> to form t = G>s modulo 2.
Here linear means all S = 2K messages can be obtained by “adding”
different combinations of the K codewords ti = G>ei where ei is K -bit
string with single 1 in position i .
Example: Suppose (N,K ) = (7, 4). To send s = 3, first create s = 0011
and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where
e0 = 0001 and e1 = 0010.
30 / 34
Linear Codes
(N,K ) Block Code
An (N,K ) block code is a list of S = 2K codewords {x(1), . . . , x(S)}, each
of length N. A message s ∈ {1, 2, . . . , 2K} is encoded as x(s).
Linear (N,K ) Block Code
A linear (N,K ) block code is an (N,K ) block code where s is first
represented as a K -bit binary vector s ∈ {0, 1}K and then encoded via
multiplication by an N × K binary matrix G> to form t = G>s modulo 2.
Here linear means all S = 2K messages can be obtained by “adding”
different combinations of the K codewords ti = G>ei where ei is K -bit
string with single 1 in position i .
Example: Suppose (N,K ) = (7, 4). To send s = 3, first create s = 0011
and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where
e0 = 0001 and e1 = 0010.
30 / 34
Linear Codes
(N,K ) Block Code
An (N,K ) block code is a list of S = 2K codewords {x(1), . . . , x(S)}, each
of length N. A message s ∈ {1, 2, . . . , 2K} is encoded as x(s).
Linear (N,K ) Block Code
A linear (N,K ) block code is an (N,K ) block code where s is first
represented as a K -bit binary vector s ∈ {0, 1}K and then encoded via
multiplication by an N × K binary matrix G> to form t = G>s modulo 2.
Here linear means all S = 2K messages can be obtained by “adding”
different combinations of the K codewords ti = G>ei where ei is K -bit
string with single 1 in position i .
Example: Suppose (N,K ) = (7, 4). To send s = 3, first create s = 0011
and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where
e0 = 0001 and e1 = 0010.
30 / 34
Types of Linear Code
Many commonly used codes are linear:
Repetition Codes: e.g., 0→ 000 ; 1→ 111
Convolution Codes: Linear coding plus bit shifts
Concatenation Codes: Two or more levels of error correction
Hamming Codes: Parity checking
Low-Density Parity-Check Codes: Semi-random construction
An NCCT can be proved for linear codes (i.e., “there exists a linear code”
replacing “there exists a code”) but the proof is still non-constructive.
Practical linear codes:
Use very large block sizes N
Based on semi-random code constructions
Apply probabilistic decoding techniques
Used in wireless and satellite communication
31 / 34
Types of Linear Code
Many commonly used codes are linear:
Repetition Codes: e.g., 0→ 000 ; 1→ 111
Convolution Codes: Linear coding plus bit shifts
Concatenation Codes: Two or more levels of error correction
Hamming Codes: Parity checking
Low-Density Parity-Check Codes: Semi-random construction
An NCCT can be proved for linear codes (i.e., “there exists a linear code”
replacing “there exists a code”) but the proof is still non-constructive.
Practical linear codes:
Use very large block sizes N
Based on semi-random code constructions
Apply probabilistic decoding techniques
Used in wireless and satellite communication
31 / 34
Types of Linear Code
Many commonly used codes are linear:
Repetition Codes: e.g., 0→ 000 ; 1→ 111
Convolution Codes: Linear coding plus bit shifts
Concatenation Codes: Two or more levels of error correction
Hamming Codes: Parity checking
Low-Density Parity-Check Codes: Semi-random construction
An NCCT can be proved for linear codes (i.e., “there exists a linear code”
replacing “there exists a code”) but the proof is still non-constructive.
Practical linear codes:
Use very large block sizes N
Based on semi-random code constructions
Apply probabilistic decoding techniques
Used in wireless and satellite communication
31 / 34
Linear Codes: Examples
(7,4) Hamming Code
G> =
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 1 1 0
0 1 1 1
1 0 1 1
For s = 0011,
G>s( mod 2) = [0 0 1 1 1 0 0]>
(6,3) Repetition Code
G> =
1 0 0
0 1 0
0 0 1
1 0 0
0 1 0
0 0 1
For s = 010,
G>s( mod 2) = [0 1 0 0 1 0]>
32 / 34
Decoding
We can construct codes with a relatively simple encoding but how do we
decode them? That is, given the input distribution and channel model Q
how do we find the posterior distribution over x given we received y?
Simple? Just compute
P(x|y) = P(y|x)∑
x′∈C P(y|x)P(x)
But:
the number of codes x ∈ C is 2K so, naively, the sum is expensive
linear codes provide structure that the above method doesn’t exploit
33 / 34
Decoding
We can construct codes with a relatively simple encoding but how do we
decode them? That is, given the input distribution and channel model Q
how do we find the posterior distribution over x given we received y?
Simple? Just compute
P(x|y) = P(y|x)∑
x′∈C P(y|x)P(x)
But:
the number of codes x ∈ C is 2K so, naively, the sum is expensive
linear codes provide structure that the above method doesn’t exploit
33 / 34
Summary and Reading
Main Points:
Joint Typicality and the Joint Typicality Theorem
The (Longer) Noisy Channel Coding Theorem
Proof Ideas
I Random Coding & Typical Set Decoding
I Average Error Over Random Codes
I Code Expurgation
Reading:
MacKay §9.7, §10.1-§10.5
34 / 34
Noisy-Channel Coding Theorem
Joint Typicality
Proof Sketch of the NCCT
Good Codes vs. Practical Codes
Linear Codes