程序代写代做代考 information theory flex AI COMP2610/6261 – Information Theory – Lecture 16: Arithmetic Coding

COMP2610/6261 – Information Theory – Lecture 16: Arithmetic Coding

COMP2610/6261 – Information Theory
Lecture 16: Arithmetic Coding

Robert C. Williamson

Research School of Computer Science

1 L O G O U S E G U I D E L I N E S T H E A U S T R A L I A N N A T I O N A L U N I V E R S I T Y

ANU Logo Use Guidelines

Deep Gold
C30 M50 Y70 K40

PMS Metallic 8620

PMS 463

Black
C0 M0 Y0 K100

PMS Process Black

Preferred logo Black version

Reverse version
Any application of the ANU logo on a coloured
background is subject to approval by the Marketing
Office, contact

brand@anu.edu.au

The ANU logo is a contemporary
reflection of our heritage.
It clearly presents our name,
our shield and our motto:

First to learn the nature of things.
To preserve the authenticity of our brand identity, there are
rules that govern how our logo is used.

Preferred logo – horizontal logo
The preferred logo should be used on a white background.
This version includes black text with the crest in Deep Gold in
either PMS or CMYK.

Black
Where colour printing is not available, the black logo can
be used on a white background.

Reverse
The logo can be used white reversed out of a black
background, or occasionally a neutral dark background.

25 September, 2018

1 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

2 / 37

Interval Codes (Recap)

Shannon-Fano-Elias Coding method:

Order the alphabet A.
Represent distribution p by cumulative distribution F
Construct code by finding intervals of width pi2 that lie in each symbol
interval [F (ai−1),F (ai))

F(ai)

F(ai-1)

F(ai)-½pi

a1 a2 a3 a4

a1

a2

a3

a4

0

1

110

100

01000

0001

3 / 37

Intervals and Prefix Codes (Recap)

The set of numbers in [0, 1) that start with a given sequence of bits
b = b1 . . . bn form the interval

[
0.b1 . . . bn, 0.b1 . . . bn +

1
2n

)
= [0.b1 . . . bn, 0.b1 . . . bn + 0.0 . . . 1)

This interval contains all binary strings for which b1b2 . . . bn is a prefix

Prefix property (interval form): Once you pick a codeword b1b2 . . . bn,
you cannot pick any codeword in the codeword interval[

0.b1b2 . . . bn, 0.b1b2 . . . bn +
1
2n

)

4 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

5 / 37

Interval Coding Blocks

What if we apply SFE coding to blocks of an ensemble X?
Example: Let A = {aa, ab, ba, bb} with p = (0.2, 0.6, 0.1, 0.1).

x p F̄ F̄2 ` Code
aa 0.2 0.1 0.000112 4 0001
ab 0.6 0.5 0.12 2 10
ba 0.1 0.85 0.11011002 5 11011
bb 0.1 0.95 0.1111002 5 11110

Extend to longer sequences

This works but:

Need P(x) for all x
Total |A|N values for length N
Huffman has similar complexity but
shorter codes.

aa

0

1

ab

ba
bb

0.2

0.8

0.9

0001

10

11011
11110

6 / 37

Interval Coding Blocks

What if we apply SFE coding to blocks of an ensemble X?
Example: Let A = {aa, ab, ba, bb} with p = (0.2, 0.6, 0.1, 0.1).

x p F̄ F̄2 ` Code
aa 0.2 0.1 0.000112 4 0001
ab 0.6 0.5 0.12 2 10
ba 0.1 0.85 0.11011002 5 11011
bb 0.1 0.95 0.1111002 5 11110

Extend to longer sequences

This works but:

Need P(x) for all x
Total |A|N values for length N
Huffman has similar complexity but
shorter codes.

aa

0

1

ab

ba
bb

0.2

0.8

0.9

0001

10

11011
11110

6 / 37

Interval Coding Blocks

What if we apply SFE coding to blocks of an ensemble X?
Example: Let A = {aa, ab, ba, bb} with p = (0.2, 0.6, 0.1, 0.1).

x p F̄ F̄2 ` Code
aa 0.2 0.1 0.000112 4 0001
ab 0.6 0.5 0.12 2 10
ba 0.1 0.85 0.11011002 5 11011
bb 0.1 0.95 0.1111002 5 11110

Extend to longer sequences

This works but:

Need P(x) for all x
Total |A|N values for length N
Huffman has similar complexity but
shorter codes.

aa

0

1

ab

ba
bb

0.2

0.8

0.9

0001

10

11011
11110

6 / 37

Arithmetic Coding: A Bird’s Eye View

Basic idea of arithmetic coding follows SFE coding

SFE Coding Arithmetic coding
Input Single outcome xi Sequence of outcomes

x1x2 . . . xN

Key step Find symbol interval for xi Find symbol interval for
x1x2 . . . xN

Use [F̄ (xi), F̄ (xi)) ?

Output Binary string corresponding
to chosen interval

Binary string corresponding
to chosen interval

Output first `(x) bits of mid-
point of interval

Output first `(x1x2 . . . xN)
bits of midpoint of interval

7 / 37

Arithmetic Coding: A Bird’s Eye View

Basic idea of arithmetic coding follows SFE coding

SFE Coding Arithmetic coding
Input Single outcome xi Sequence of outcomes

x1x2 . . . xN

Key step Find symbol interval for xi Find symbol interval for
x1x2 . . . xN

Use [F (xi−1),F (xi)) ?

Output Binary string corresponding
to chosen interval

Binary string corresponding
to chosen interval

Output first `(xi) bits of mid-
point of interval

Output first `(x1x2 . . . xN)
bits of midpoint of interval

8 / 37

Arithmetic Coding: Summary

Arithmetic coding has some important properties:
We do not compute a symbol coding for X and then concatenate

I Directly work with blocks of size N
We do not explicitly code all length N sequences at once

I Highly efficient
We do not assume that each of the xi ’s is independent

I Not restricted to extended ensembles
I Adapts to data distribution

9 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

10 / 37

Computing an Interval for Sequences

Say N = 2 and we want to code x1x2

Ideally, we’d like to compute p(x) for all possible x of length 2, and then
find the interval for p(x1x2)

Key ideas:
we can write p(x1x2) = p(x1)p(x2|x1)

I decompose joint into conditional probabilities

p(·|x1) is just another probability distribution
I so we can compute intervals as per SFE

we can find an interval for p(x2|x1) within the interval for x1
I normal SFE computes the interval within [0, 1) by default

11 / 37

Computing an Interval for Sequences

Say N = 2 and we want to code x1x2

Ideally, we’d like to compute p(x) for all possible x of length 2, and then
find the interval for p(x1x2)

Key ideas:
we can write p(x1x2) = p(x1)p(x2|x1)

I decompose joint into conditional probabilities

p(·|x1) is just another probability distribution
I so we can compute intervals as per SFE

we can find an interval for p(x2|x1) within the interval for x1
I normal SFE computes the interval within [0, 1) by default

11 / 37

Computing an Interval for Sequences

Say N = 2 and we want to code x1x2

Ideally, we’d like to compute p(x) for all possible x of length 2, and then
find the interval for p(x1x2)

Key ideas:
we can write p(x1x2) = p(x1)p(x2|x1)

I decompose joint into conditional probabilities

p(·|x1) is just another probability distribution
I so we can compute intervals as per SFE

we can find an interval for p(x2|x1) within the interval for x1
I normal SFE computes the interval within [0, 1) by default

11 / 37

Computing an Interval for Sequences

Example: Suppose A = {a, b, c} and p(a) = 0.25, p(b) = 0.5, p(c) = 0.25

Like with SFE coding, we’d begin by slicing up [0, 1) into three subintervals:

0

0.25

0.75

1

a

b

c

So e.g. we treat [0.25, 0.75) as the interval for b

12 / 37

Computing an Interval for Sequences

Example: Suppose A = {a, b, c} and p(a) = 0.25, p(b) = 0.5, p(c) = 0.25

Like with SFE coding, we’d begin by slicing up [0, 1) into three subintervals:

0

0.25

0.75

1

a

b

c

So e.g. we treat [0.25, 0.75) as the interval for b
12 / 37

Computing an Interval for Sequences

Suppose the first symbol is b, and p(a|b) = 0.25, p(b|b) = 0.5, p(c|b) = 0.25

To code ba, bb, bc, now slice up [0.25, 0.75), the interval for b itself:

0

0.25

0.75

0.3725

0.625

1

a

b

c

ba

bb

bc

For ba we choose the interval of length p(a|b) = 0.25 times the length of the
enclosing interval (0.75− 0.25 = 0.5), i.e. (0.25)(0.5) = 0.125

13 / 37

Computing an Interval for Sequences

Suppose the first symbol is b, and p(a|b) = 0.25, p(b|b) = 0.5, p(c|b) = 0.25

To code ba, bb, bc, now slice up [0.25, 0.75), the interval for b itself:

0

0.25

0.75

0.3725

0.625

1

a

b

c

ba

bb

bc

For ba we choose the interval of length p(a|b) = 0.25 times the length of the
enclosing interval (0.75− 0.25 = 0.5), i.e. (0.25)(0.5) = 0.125

13 / 37

Computing an Interval for Sequences

Suppose the first symbol is b, and p(a|b) = 0.25, p(b|b) = 0.5, p(c|b) = 0.25

To code ba, bb, bc, now slice up [0.25, 0.75), the interval for b itself:

0

0.25

0.75

0.3725

0.625

1

a

b

c

ba

bb

bc

For ba we choose the interval of length p(a|b) = 0.25 times the length of the
enclosing interval (0.75− 0.25 = 0.5), i.e. (0.25)(0.5) = 0.125

13 / 37

Arithmetic Coding: End of Stream Symbol

It is convenient to explicitly have a special “end of stream” symbol, 2

We add this symbol to our ensemble, with some suitable probability

e.g. p(2) = probability of seeing empty string, p(2|a) = probability of
seeing just the string a, etc
Implicitly we think of ab as actually being ab2

End of stream is by definition reached when we choose the sub-interval for
this special symbol

14 / 37

Arithmetic Coding: End of Stream Example

Example: Suppose A = {a, b, c,2} and
p(a) = 0.25, p(b) = 0.5, p(c) = 0.15, p(2) = 0.1

Like with SFE coding, we’d begin by slicing up [0, 1) into three subintervals:

0

0.25

0.75

1

a

b

c

� 0.9

15 / 37

Arithmetic Coding: End of Stream Example

Example: Suppose A = {a, b, c,2} and
p(a) = 0.25, p(b) = 0.5, p(c) = 0.15, p(2) = 0.1

Like with SFE coding, we’d begin by slicing up [0, 1) into three subintervals:

0

0.25

0.75

1

a

b

c

� 0.9

15 / 37

Arithmetic Coding: End of Stream Example

Now suppose that p(·|b) stays the same as p(·)

If the first symbol is b, we carve the interval for b into four pieces:

0

0.25

0.75

1

a

b

c

� 0.9

ba

bb

bc
b�

0.3725

0.625
0.7

Exact same idea as before, just with special symbol 2

16 / 37

Arithmetic Coding: End of Stream Example

Now suppose that p(·|b) stays the same as p(·)

If the first symbol is b, we carve the interval for b into four pieces:

0

0.25

0.75

1

a

b

c

� 0.9

ba

bb

bc
b�

0.3725

0.625
0.7

Exact same idea as before, just with special symbol 2
16 / 37

Arithmetic Coding for Arbitrary Sequences

These ideas generalise to arbitrary length sequences

We don’t even need to know the sequence length beforehand!

As we see more symbols, we slice the appropriate sub-interval of [0, 1)
based on the probabilities

Terminate whenever we see 2

17 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

18 / 37

Arithmetic Coding: Codeword Generation

Once we’ve seen the entire sequence, we end up with interval [u, v)

How to output a codeword?

As per SFE coding, we can use the first `(x1x2 . . . xN) bits of (u + v)/2

Here, `(x1x2 . . . xN) = dlog 1/p(x1x2 . . . xN)e+ 1
As before, this guarantees all strings starting with codeword are
contained in the codeword interval

Generally, we can output some bits on the fly, rather than wait till we
process the entire sequence

19 / 37

Arithmetic Coding: Codeword Generation

Once we’ve seen the entire sequence, we end up with interval [u, v)

How to output a codeword?

As per SFE coding, we can use the first `(x1x2 . . . xN) bits of (u + v)/2

Here, `(x1x2 . . . xN) = dlog 1/p(x1x2 . . . xN)e+ 1
As before, this guarantees all strings starting with codeword are
contained in the codeword interval

Generally, we can output some bits on the fly, rather than wait till we
process the entire sequence

19 / 37

Arithmetic Coding: Codeword Generation Example

In previous example with input b, we’d stop in the interval for b2, i.e. [0.7, 0.75)

0

0.25

0.75

1

a

b

c

� 0.9

ba

bb

bc
b�

0.3725

0.625
0.7

Midpoint is 0.725 = 101110011, and p(b2) = (1/2) · (0.1) = 0.05

Output the first dlog2 1/0.05e+ 1 = 6 bits, i.e. 101110

20 / 37

Arithmetic Coding: Codeword Generation Example

In previous example with input b, we’d stop in the interval for b2, i.e. [0.7, 0.75)

0

0.25

0.75

1

a

b

c

� 0.9

ba

bb

bc
b�

0.3725

0.625
0.7

Midpoint is 0.725 = 101110011, and p(b2) = (1/2) · (0.1) = 0.05

Output the first dlog2 1/0.05e+ 1 = 6 bits, i.e. 101110
20 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

21 / 37

Arithmetic Coding: Formal Encoder

Formally, we compute the interval [u, v) for a generic sequence as follows:

Arithmetic Coding of stream x1x2 . . .
u ← 0.0
v ← 1.0
p ← v − u

for n = 1, 2, . . .

Compute Ln(ai |x1, …, xn−1) =
∑i−1

i′=1 p(xn = ai′ |x1, . . . , xn−1)
Compute Un(ai |x1, …, xn−1) =

∑i
i′=1 p(xn = ai′ |x1, . . . , xn−1)

v ← u + p · Un(xn|x1, …, xn−1)
u ← u + p · Ln(xn|x1, …, xn−1)
p ← v − u
if xn = 2, terminate

Output first `(x1x2 . . . xN) = dlog 1/pe+ 1 bits of (u + v)/2

Here, Ln,Un just compute the appropriate lower and upper bounds, as per SFE coding
We rescale these based on the current interval length

22 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

23 / 37

Decoding

How do we decode a sequence of bits?

Rough Sketch:

Carve out [0, 1) based on initial distribution

Keep reading bits until current code interval in a symbol interval

Output that symbol

Carve out appropriate interval based on probabilities

We can stop once we have containment in interval for 2

24 / 37

Decoding: Example

Suppose p(a) = 0.5, p(b) = 0.125, p(c) = 0.25, p(2) = 0.125 for every
outcome in sequence

Decode 0110111:

Sequence Interval (Binary) Interval (Decimal) Comment
0 [0.0, 0.1)2 [0, 0.5)10 First symbol a

01 [0.01, 0.10)2 [0.25, 0.5)10
011 [0.011, 0.100)2 [0.375, 0.5)10 Next symbol c
0110 [0.0110, 0.0111)2 [0.375, 0.4375)10
01101 [0.01101, 0.01110)2 [0.40625, 0.4375)10
011011 [0.011011, 0.011100)2 [0.421875, 0.4375)10 Next symbol 2

The last bit here is actually redundant (inherited from +1 bit in midpoint
representation)

25 / 37

Decoding: Example

Suppose p(a) = 0.5, p(b) = 0.125, p(c) = 0.25, p(2) = 0.125 for every
outcome in sequence

Decode 0110111:

Sequence Interval (Binary) Interval (Decimal) Comment
0 [0.0, 0.1)2 [0, 0.5)10 First symbol a
01 [0.01, 0.10)2 [0.25, 0.5)10

011 [0.011, 0.100)2 [0.375, 0.5)10 Next symbol c
0110 [0.0110, 0.0111)2 [0.375, 0.4375)10
01101 [0.01101, 0.01110)2 [0.40625, 0.4375)10
011011 [0.011011, 0.011100)2 [0.421875, 0.4375)10 Next symbol 2

The last bit here is actually redundant (inherited from +1 bit in midpoint
representation)

25 / 37

Decoding: Example

Suppose p(a) = 0.5, p(b) = 0.125, p(c) = 0.25, p(2) = 0.125 for every
outcome in sequence

Decode 0110111:

Sequence Interval (Binary) Interval (Decimal) Comment
0 [0.0, 0.1)2 [0, 0.5)10 First symbol a
01 [0.01, 0.10)2 [0.25, 0.5)10
011 [0.011, 0.100)2 [0.375, 0.5)10 Next symbol c

0110 [0.0110, 0.0111)2 [0.375, 0.4375)10
01101 [0.01101, 0.01110)2 [0.40625, 0.4375)10
011011 [0.011011, 0.011100)2 [0.421875, 0.4375)10 Next symbol 2

The last bit here is actually redundant (inherited from +1 bit in midpoint
representation)

25 / 37

Decoding: Example

Suppose p(a) = 0.5, p(b) = 0.125, p(c) = 0.25, p(2) = 0.125 for every
outcome in sequence

Decode 0110111:

Sequence Interval (Binary) Interval (Decimal) Comment
0 [0.0, 0.1)2 [0, 0.5)10 First symbol a
01 [0.01, 0.10)2 [0.25, 0.5)10
011 [0.011, 0.100)2 [0.375, 0.5)10 Next symbol c
0110 [0.0110, 0.0111)2 [0.375, 0.4375)10

01101 [0.01101, 0.01110)2 [0.40625, 0.4375)10
011011 [0.011011, 0.011100)2 [0.421875, 0.4375)10 Next symbol 2

The last bit here is actually redundant (inherited from +1 bit in midpoint
representation)

25 / 37

Decoding: Example

Suppose p(a) = 0.5, p(b) = 0.125, p(c) = 0.25, p(2) = 0.125 for every
outcome in sequence

Decode 0110111:

Sequence Interval (Binary) Interval (Decimal) Comment
0 [0.0, 0.1)2 [0, 0.5)10 First symbol a
01 [0.01, 0.10)2 [0.25, 0.5)10
011 [0.011, 0.100)2 [0.375, 0.5)10 Next symbol c
0110 [0.0110, 0.0111)2 [0.375, 0.4375)10
01101 [0.01101, 0.01110)2 [0.40625, 0.4375)10

011011 [0.011011, 0.011100)2 [0.421875, 0.4375)10 Next symbol 2

The last bit here is actually redundant (inherited from +1 bit in midpoint
representation)

25 / 37

Decoding: Example

Suppose p(a) = 0.5, p(b) = 0.125, p(c) = 0.25, p(2) = 0.125 for every
outcome in sequence

Decode 0110111:

Sequence Interval (Binary) Interval (Decimal) Comment
0 [0.0, 0.1)2 [0, 0.5)10 First symbol a
01 [0.01, 0.10)2 [0.25, 0.5)10
011 [0.011, 0.100)2 [0.375, 0.5)10 Next symbol c
0110 [0.0110, 0.0111)2 [0.375, 0.4375)10
01101 [0.01101, 0.01110)2 [0.40625, 0.4375)10
011011 [0.011011, 0.011100)2 [0.421875, 0.4375)10 Next symbol 2

The last bit here is actually redundant (inherited from +1 bit in midpoint
representation)

25 / 37

1 From SFE to Arithmetic Coding

2 Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

3 Arithmetic Coding: Decoder

4 Adapting Distributions On-The-Fly

26 / 37

Adaptive Probabilities

So far we assume the sequence of probabilities are given in advance

In Lecture 5, you saw Bernoulli distribution for two outcomes

Beta distribution Beta(θ|mh,mt ) as a prior for Bern(x |θ)

The posterior after observing nh heads and nt tails is just
Beta(θ|mh + nh,mt + nt )

The expected value of θ under the posterior is

p(x = h|nh, nt ,mh,mt ) =
mh + nh

mh + nh + mt + nt

27 / 37

Adaptive Probabilities

So far we assume the sequence of probabilities are given in advance

In Lecture 5, you saw Bernoulli distribution for two outcomes

Beta distribution Beta(θ|mh,mt ) as a prior for Bern(x |θ)

The posterior after observing nh heads and nt tails is just
Beta(θ|mh + nh,mt + nt )

The expected value of θ under the posterior is

p(x = h|nh, nt ,mh,mt ) =
mh + nh

mh + nh + mt + nt

27 / 37

Dirichlet Model

A Dirichlet distribution is a generalisation of the Beta distribution to more
than two outcomes. Its parameter is a vector m = (m1, . . . ,mK ) can be
viewed as “virtual counts” for each symbol a1, . . . , aK :

P(x = ai |x1 . . . xn) =
](ai) + mi∑K

k=1 ](ak ) + mk

Can implement an adaptive guesser by just counting symbol occurrences.

Flexible
e.g., Choose m to be frequency of English letters


k mk Large = Stable; Small = Responsive

28 / 37

Dirichlet Model

A Dirichlet distribution is a generalisation of the Beta distribution to more
than two outcomes. Its parameter is a vector m = (m1, . . . ,mK ) can be
viewed as “virtual counts” for each symbol a1, . . . , aK :

P(x = ai |x1 . . . xn) =
](ai) + mi∑K

k=1 ](ak ) + mk

Can implement an adaptive guesser by just counting symbol occurrences.

Flexible
e.g., Choose m to be frequency of English letters


k mk Large = Stable; Small = Responsive

28 / 37

title

Example: Start with mh = mt = 1 and observe sequence hht.

p(·|�) = ( 12 ,
1
2 ), p(·|h) = (

2
3 ,

1
3 ), p(·|hh) = (

3
4 ,

1
4 ) p(·|hht) = (

3
5 ,

2
5 )

viz. Laplace’s Rule, where � means empty string

Why? Because e.g.

p(h|h) =
1 + 1

1 + 0 + 1 + 1
= 2/3

p(t|h) =
0 + 1

1 + 0 + 1 + 1
= 1/3

We’ll assume this learning is only for non 2 symbols

assume 2 occurs with fixed probability each time

29 / 37

Adaptive Probabilities: Example

Possible outcomes a, b,2

Sequence: �

Probabilities: p|� = (0.425, 0.425, 0.15)

Encoder: �

a

b

0.011011…

0.110110…

We start off with virtual counts ma = mb = 1

30 / 37

Adaptive Probabilities: Example

Possible outcomes a, b,2

Observations: b

Probabilities: p|b ≈ (0.28, 0.57, 0.15)

Encoder Output: �

a

b

ba

bb

b☐

0.011011…

0.110110…

0.100010…

0.110010…

Seeing b makes us update p(a|b) = (0.85) · (1/3) ≈ 0.28, and
p(b|b) = (0.85) · (2/3) ≈ 0.57. We keep p(2|b) = p(2).

31 / 37

Adaptive Probabilities: Example

Possible outcomes a, b,2

Observations: bb

Probabilities: p|bb ≈ (0.21, 0.64, 0.15)

Encoder Output: 1

a

b

ba

bb

b☐

bba

bbb
bb☐

0.011011…

0.110110…

0.100010…

0.110010…

0.100110…

0.101111…

Seeing bb makes us update p(a|bb) = (0.85) · (1/4) ≈ 0.21, and
p(b|bb) = (0.85) · (3/4) ≈ 0.64
Now the first bit is unambiguously 1

32 / 37

Adaptive Probabilities: Example

Possible outcomes a, b,2

Observations: bbb

Probabilities: p|bbb ≈ (0.17, 0.68, 0.15)

Encoder Output: 1

a

b

ba

bb

b☐

bba

bbb
bb☐

bbba

0.011011…

0.110110…

0.100010…

0.110010…

0.100110…

0.101111…

0.100111…

33 / 37

Adaptive Probabilities: Example

Possible outcomes a, b,2

Observations: bbba

Probabilities: p|bbba ≈ (0.28, 0.57, 0.15)

Encoder Output: 1001

bbba

bbba☐

bbbab

bbbaa

0.100110000…

0.100111110…

0.100110100…

0.100111100…

On seeing a, we can fill in three further bits unambiguously

34 / 37

Adaptive Probabilities: Example

Possible outcomes a, b,2

Observations: bbba2

Probabilities: N/A

Encoder Output: 100111101

bbba

bbba☐

bbbab

bbbaa

0.100110000…

0.100111110…

0.100110100…

0.100111100…

To terminate, we find midpoint of 0.100111100 . . . and 0.100111110…

35 / 37

Arithmetic Coding: Example (MacKay, Figure 6.4)

a

b

!

ba

bb

b!

bba

bbb

bb!

bbba

bbbb

bbb!

0

1

00

01

000

001

0000

0001

00000
00001
00010
00011

0010

0011

00100
00101
00110
00111

010

011

0100

0101

01000
01001
01010
01011

0110

0111

01100
01101
01110
01111

10

11

100

101

1000

1001

10000
10001
10010
10011

1010

1011

10100
10101
10110
10111

110

111

1100

1101

11000
11001
11010
11011

1110

1111

11100
11101
11110
11111

!
!
!
!




100111101
#
#
##$

bbba

bbbaa

bbbab

bbba!

10011

10010111
10011000
10011001
10011010
10011011
10011100
10011101
10011110
10011111
10100000

36 / 37

Summary and Reading

Main Points
Arithmetic Coding:

I Uses interval coding and conditional probability

I Separates coding and prediction

I No need to compute every code word
Predictive distributions:

I Update distribution after each symbol

I Beta and Dirichlet priors = virtual counts

Reading

Section 6.2 of MacKay

37 / 37

From SFE to Arithmetic Coding
Arithmetic Coding: Encoder
Intervals for Sequences
Codeword Generation
Putting it all together

Arithmetic Coding: Decoder
Adapting Distributions On-The-Fly