COMP90038 Algorithms and Complexity
for Data Compression
(Slides from Harald Søndergaard)
Lecture 21
Semester 2, 2021
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 1 / 17
Announcements
Assignment 2 is out on LMS.
Olya’s consultation session (Zoom link via LMS):
Tuesday October 12, 2:30 – 3:30 pm
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 2 / 17
Last lecture: Greedy algorithms
Prim’s algorithm for finding minimum spanning trees Dijkstra’s algorithm for single-source shortest paths
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 3 / 17
Data Compression
From an information-theoretic point of view, most computer files contain much redundancy.
Compression is used to store files in less space.
For text files, savings up to 50% are common.
For binary files, savings up to 90% are common.
Savings in space mean savings in time for file transmission. Savings in computer space and time mean savings in energy usage.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 4 / 17
Run-Length Encoding
For a text that has long runs of repeated characters, we could compress by counting the runs:
AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD can then be encoded as
4A3BAA5B8CDABCB3A4B3CD
For English text this is not very useful. For binary files it can be very good.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 5 / 17
Run-Length Encoding
000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 6 / 17
Variable-Length Encoding
Instead of using, say, 8 bits for characters (as ASCII does), assign characters codes in such a way that common characters have shorter codes.
For example, E could have code 0.
But then no other character code can start with 0.
In general, no character’s code should be a prefix of some other character’s code (unless we somehow put separators between characters, which would take up space).
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 7 / 17
Variable-Length Encoding
Suppose we count symbols and find these numbers of occurrences:
Here are some sensible codes that we may use for symbols:
Symbol Weight
B4 D5 G 10 F 12 C 14 E 27 A 28
Symbol Code
A 11 B 0000 C 011 D 0001 E 10 F 010 G 001
Algorithms and Complexity (Sem 2, 2021)
© University of Melbourne 8 / 17
Tries for Variable-Length Encoding
A trie is a binary tree for search applications.
To search for a key we look at individual bits of a key and descend to
the left whenever a bit is 0, to the right whenever it is 1.
Using a trie to determine codes means that no code will be the prefix
of another!
10 (E) 11 (A) 001 (G) 010 (F) 011 (C)
0000 (B) 0001 (D)
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 9 / 17
Variable-Length Encoding
With the codes given by the trie, we can represent FACE
in just 10 bits:
010 11 011 10
FACE
If we were to assign 3 bits per character, FACE would take 12 bits.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 10 / 17
Encoding a Message
We shall shortly see how to design the codes for symbols, taking symbol frequencies into account.
Once we have a table of codes, encoding is straightforward.
For example, to encode ‘BAGGED’, simply concatenate the codes for B, A, G, G, E and D:
000011001001100001
A 11 B 0000 C 011 D 0001 E 10 F 010 G 001
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 11 / 17
Decoding a Message
To decode 000011001001100001, use the trie, repeatedly starting from the root, and printing each symbol found as a leaf.
EA GFC
BD
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 12 / 17
: Choosing the Codes
Sometimes (for example for common English text) we may know the frequencies of letters fairly well.
If we don’t know about frequencies then we can still count all characters in the given text as a first step.
But how do we assign codes to the characters once we know their frequencies?
By repeatedly selecting the two smallest weights and fusing them. This is Huffman’s algorithm (1952)—another example of a greedy
method.
The resulting tree is a Huffman tree.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 13 / 17
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
9
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
19 9
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
19
9 26
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
45 19
9 26
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
45 19
9 2655
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
45 19
9 2655
4 5 10 12 14 27 28 BDGFCEA
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
45 19
9 2655
4 5 10 12 14 27 28 BDGFCEA
We end up with the trie from before!
One can show that this gives the best encoding.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 14 / 17
Book-Keeping
What is a suitable data structure to help keep track of which trees to fuse next?
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 15 / 17
Book-Keeping
What is a suitable data structure to help keep track of which trees to fuse next? A priority queue—a min-heap of weights.
We use the priority queue construction algorithm to turn this: (28,A), (4,B), (14,C), (5,D), (27,E), (12,F), (10,G)
into an initial min-heap: (4,B)
while |PQ| > 1 do
T1 ← EjectMin(PQ)
T2 ← EjectMin(PQ) Inject(Fuse(T1, T2), PQ)
(5,D)
(10,G)
(28,A) (27,E) (12,F) (14,C)
Now, while the priority queue has size 2 or more, take out the two highest-priority trees, fuse them, and put the fused tree back in. After that, a single EjectMin(PQ) yields the Huffman tree.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 15 / 17
Compressed Transmission
If the compressed file is being sent from one party to another, the parties must agree about the codes used.
Alternatively, the trie can be sent along with the message.
For long files this extra cost is negligible.
Modern variants of Huffman encoding, like Lempel-Ziv compression, assign codes not to individual symbols but to sequences of symbols.
Is Huffman encoding always good?
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 16 / 17
Next Lecture
In the next lecture we will briefly discuss complexity theory (a large topic), NP-completeness, and approximation algorithms.
Algorithms and Complexity (Sem 2, 2021) © University of Melbourne 17 / 17