COMP20007 Design of Algorithms
Data Compression
Daniel Beck
Lecture 16
Semester 1, 2020
1
Introduction
• So far, we talked about speed and space performance from an algorithm point of view.
2
Introduction
• So far, we talked about speed and space performance from an algorithm point of view.
• We assumed that records could fit in memory. (although we did mention secondary memory in Mergesort and B-trees)
2
Introduction
• So far, we talked about speed and space performance from an algorithm point of view.
• We assumed that records could fit in memory. (although we did mention secondary memory in Mergesort and B-trees)
• What to do when records are too large? (videos, for instance)
2
Fixed-length encoding
For text files, suppose each character has an fixed-size binary code.
3
Fixed-length encoding
For text files, suppose each character has an fixed-size binary code.
BAGGED
3
Fixed-length encoding
For text files, suppose each character has an fixed-size binary code.
BAGGED
01000010 01000001 01000111 01000111 01000101 01000100
3
Fixed-length encoding
For text files, suppose each character has an fixed-size binary code.
BAGGED
01000010 01000001 01000111 01000111 01000101 01000100
This is exactly what ASCII does.
Key insight: this coding has redundant information.
3
Run-length encoding
010000100100000101000111010001110100010101000100
4
Run-length encoding
010000100100000101000111010001110100010101000100
0140120150101303101303101301010130120
4
Run-length encoding
010000100100000101000111010001110100010101000100
0140120150101303101303101301010130120
Character-level: BAGGED→BA2GED
AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD 4A3BAA5B8CDABCB3A4B3CD
4
Run-length encoding
• While not very useful for text data, it can work for some kinds of binary data.
5
Run-length encoding
• While not very useful for text data, it can work for some kinds of binary data.
• For text, the best algorithms move away from using fixed-length codes (ASCII).
5
Variable-length Encoding
6
Variable-length Encoding
• Key idea: some symbols appear more frequently than others.
6
Variable-length Encoding
• Key idea: some symbols appear more frequently than others.
• Instead of a fixed number of bits per symbol, use a variable number:
• More frequent symbols use less bits. • Less frequent symbols use more bits.
6
Variable-length Encoding
• Key idea: some symbols appear more frequently than others.
• Instead of a fixed number of bits per symbol, use a variable number:
• More frequent symbols use less bits. • Less frequent symbols use more bits.
• For this scheme to work, no symbol code can be a prefix of another symbol’s code.
6
Variable-Length Encoding
Suppose we count symbols and find these numbers of occurrences:
Symbol Weight
B4 D5 G 10 F 12 C 14 E 27 A 28
7
Variable-Length Encoding
Suppose we count symbols and find these numbers of occurrences:
Here are some sensible codes that we may use for symbols:
Symbol Weight
B4 D5 G 10 F 12 C 14 E 27 A 28
Symbol Code
A 11 B 0000 C 011 D 0001 E 10 F 010 G 001
7
Encoding a string
• Codes can be stored in a dictionary
8
Encoding a string
• Codes can be stored in a dictionary
• Once we have the codes, encoding is straightforward.
8
Encoding a string
• Codes can be stored in a dictionary
• Once we have the codes, encoding is
straightforward.
• For example, to encode ‘BAGGED’, simply concatenate the codes for B, A, G, G, E and D:
000011001001100001
A 11 B 0000 C 011 D 0001 E 10 F 010 G 001
8
Decoding a string
• To decode we can use another dictionary where keys are codes and values are symbols.
9
Decoding a string
• To decode we can use another dictionary where keys are codes and values are symbols.
• Starting from the first digit, look in the dictionary. If not present, concatenate the next digit and repeat until code is valid.
9
Decoding a string
• To decode we can use another dictionary where keys are codes and values are symbols.
• Starting from the first digit, look in the dictionary. If not present, concatenate the next digit and repeat until code is valid.
000011001001100001
11 A 0000 B 011 C 0001 D 10 E 010 F 001 G
9
Decoding a string
• To decode we can use another dictionary where keys are codes and values are symbols.
• Starting from the first digit, look in the dictionary. If not present, concatenate the next digit and repeat until code is valid.
000011001001100001
Seems like it requires lots of misses, is there a better way?
11 A 0000 B 011 C 0001 D 10 E 010 F 001 G
9
Tries
• Another implementation of a dictionary.
10
Tries
• Another implementation of a dictionary. • Works when keys can be decomposed
10
Tries
• Another implementation of a dictionary. • Works when keys can be decomposed
10 (E) 001 (G) 010 (F) 011 (C)
0000 (B) 0001 (D)
This specific trie stores values only in the leaves → keeps prefix property.
11 (A)
10
Tries
To decode 000011001001100001, use the trie, repeatedly starting from the root, and printing each symbol found as a leaf.
EA GFC
BD
11
Tries
To decode 000011001001100001, use the trie, repeatedly starting from the root, and printing each symbol found as a leaf.
EA GFC
BD How to choose the codes?
11
Huffman Encoding
12
Huffman Encoding
• Goal: obtain the optimal encoding given symbol frequencies.
12
Huffman Encoding
• Goal: obtain the optimal encoding given symbol frequencies.
• Treat each symbol as a leaf and build a binary tree bottom-up.
12
Huffman Encoding
• Goal: obtain the optimal encoding given symbol frequencies.
• Treat each symbol as a leaf and build a binary tree bottom-up.
• Two nodes are fused if they have the smallest frequency.
12
Huffman Encoding
• Goal: obtain the optimal encoding given symbol frequencies.
• Treat each symbol as a leaf and build a binary tree bottom-up.
• Two nodes are fused if they have the smallest frequency.
• The resulting tree is a Huffman tree.
12
Huffman Trees
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
9
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
9 19
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
9 19 26
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
19 45
9 26
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
19 45
9 2655
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
19 45
9 2655
4 5 10 12 14 27 28 BDGFCEA
13
Huffman Trees
19 45
9 2655
4 5 10 12 14 27 28 BDGFCEA
We end up with the trie from before!
13
The Greedy Method
Huffman is an example of the greedy method.
14
The Greedy Method
Huffman is an example of the greedy method.
• The goal is optimisation: many solutions are “acceptable” but we want to find the best one.
14
The Greedy Method
Huffman is an example of the greedy method.
• The goal is optimisation: many solutions are “acceptable”
but we want to find the best one.
• The problem can be divided into local subproblems.
14
The Greedy Method
Huffman is an example of the greedy method.
• The goal is optimisation: many solutions are “acceptable”
but we want to find the best one.
• The problem can be divided into local subproblems.
• The greedy method builds a global solution through local and irrevocable solutions.
14
The Greedy Method
Huffman is an example of the greedy method.
• The goal is optimisation: many solutions are “acceptable”
but we want to find the best one.
• The problem can be divided into local subproblems.
• The greedy method builds a global solution through local and irrevocable solutions.
• For Huffman, the greedy methods finds the optimal. Dijkstra and Prim are other examples.
14
The Greedy Method
Huffman is an example of the greedy method.
• The goal is optimisation: many solutions are “acceptable”
but we want to find the best one.
• The problem can be divided into local subproblems.
• The greedy method builds a global solution through local and irrevocable solutions.
• For Huffman, the greedy methods finds the optimal. Dijkstra and Prim are other examples.
• But this is not always the case.
14
Summary
15
Summary
• Most data we store in our computer has redundancy.
15
Summary
• Most data we store in our computer has redundancy. • Compression uses redundancy to reduce space.
15
Summary
• Most data we store in our computer has redundancy. • Compression uses redundancy to reduce space.
• Huffman is based on variable-length encoding.
15
Summary
• Most data we store in our computer has redundancy. • Compression uses redundancy to reduce space.
• Huffman is based on variable-length encoding.
• Tries to store codes.
15
Data Compression – In Practice
16
Data Compression – In Practice
• Huffman encoding provides the basis for many advanced compression techniques.
16
Data Compression – In Practice
• Huffman encoding provides the basis for many advanced compression techniques.
• Lempel-Ziv compression assigns codes to sequences of symbols: used in GIF, PNG and ZIP.
16
Data Compression – In Practice
• Huffman encoding provides the basis for many advanced compression techniques.
• Lempel-Ziv compression assigns codes to sequences of symbols: used in GIF, PNG and ZIP.
• For sequential data (audio/video), an alternative is linear prediction: predict the next frame given the previous ones. Used in FLAC.
16
Data Compression – In Practice
• Huffman encoding provides the basis for many advanced compression techniques.
• Lempel-Ziv compression assigns codes to sequences of symbols: used in GIF, PNG and ZIP.
• For sequential data (audio/video), an alternative is linear prediction: predict the next frame given the previous ones. Used in FLAC.
• Lossy compression: JPEG, MP3, MPEG and others. Also employ Huffman (among other techniques).
16
Data Compression – In Practice
• Huffman encoding provides the basis for many advanced compression techniques.
• Lempel-Ziv compression assigns codes to sequences of symbols: used in GIF, PNG and ZIP.
• For sequential data (audio/video), an alternative is linear prediction: predict the next frame given the previous ones. Used in FLAC.
• Lossy compression: JPEG, MP3, MPEG and others. Also employ Huffman (among other techniques).
Next lecture: how to trade memory for speed and get an Θ(n) worst case sorting algorithm…
16