COMP90038 Algorithms and Complexity
Lecture 21: Huffman Encoding for Data Compression (with thanks to Harald Søndergaard & Michael Kirley)
Andres Munoz-Acosta
munoz.m@unimelb.edu.au
Peter Hall Building G.83
• The SES is open now. Please take a time to review this subject. All feedback is greatly appreciated.
• The final exam has been scheduled for 1:15PM on Thursday 8th of November at Wilson Hall
• It is a closed book exam. No calculators are permitted.
• Reading time will be 15 minutes. Exam duration will be 3 hours.
• Answers must be provided in the exam paper in the space allocated.
• The reverse side of the pages can be used for rough work.
• All questions should be attempted. Some are easier than others.
• Any unreadable parts will be considered wrong. Be neat in your answers.
• A sample exam paper will be provided this week.
• We have instructed the tutors NOT to provide hints on the sample exam
• Assignment 2 is due next Sunday at 11:59PM.
• We will provide sample answers on during the lecture and through the LMS.
• Next week we will use both lectures for a quick review of the content. • Only examinable topics will be discussed in the review.
Recap
• We discussed greedy algorithms:
• A problem solving strategy that takes the locally best choice among all
feasible ones. Such choice is irrevocable.
• Usually, locally best choices do not yield global best results.
• In some exceptions a greedy algorithm is correct and fast.
• Also, a greedy algorithm can provide good approximations.
• We applied this idea to two graph problems :
• Prim’s algorithm for finding minimum spanning trees • Dijkstra’s algorithm for single-source shortest path
What is a Minimum Spanning Tree?
• A minimum spanning tree of a weighted graph V,E is a tree V,E’ where E’ is a subset of E, such that the connections have the lowest cost
• We use Prim’s algorithm to find the minimum spanning tree.
• It constructs a sequence of subtrees T, by adding to the latest tree the closest node not currently on it.
a 6 c 5 e
5 4 1 2 3 4
b 2 d 4 f
Prim’s Algorithm
• We examined the complete algorithm, that uses priority queues:
Another example
• Let’s work with the following graph: b1c
Tree T
a
b
c
d
e
f
cost
prev
nil
nil
nil
nil
nil
nil
cost
0
prev
nil
nil
nil
nil
nil
nil
a
cost
3
6
5
prev
a
nil
nil
a
a
a,b
cost
1
6
4
prev
b
nil
a
b
a,b,c
cost
6
6
4
prev
c
a
b
a,b,c,f
cost
5
2
prev
f
f
a,b,c,f,e
cost
5
prev
f
a,b,c,f,e,d
cost
prev
3
44
6 a5f5d
2 e
6
• What would happen if we start on b?
8
• The sequence will be different, but the edges may be the same
• How many different trees can we have?
• If there are ties, the tie breaking has influence
Dijkstra’s Algorithm
• Dijkstra’s algorithm finds all shortest paths from a fixed start node. Its complexity is the same as that of Prim’s algorithm.
Another example
• Let’s work with this graph again: b1c
Tree T
a
b
c
d
e
f
cost
prev
nil
nil
nil
nil
nil
nil
cost
0
prev
nil
nil
nil
nil
nil
nil
a
cost
3
6
5
prev
a
nil
nil
a
a
a,b
cost
4
6
5
prev
b
nil
a
a
a,b,c
cost
10
6
5
prev
c
a
a
a,b,c,f
cost
10
6
prev
c
a
a,b,c,f,e
cost
10
prev
c
a,b,c,f,e,d
cost
prev
3
44
6 a5f5d
6
• What would happen if we start on b?
2 e
8
• It is possible to end up with a different tree
• How many different trees can we have? • Ties can also influence the final tree.
Data compression
• From an information-theoretic point of view, most computer files contain much redundancy.
• Compression is used to store files in less space. • For text files, savings up to 50 are common.
• For binary files, savings up to 90 are common.
• Savings in space mean savings in time for file transmission.
Run-Length Encoding
• For a text with long runs of repeated characters, we could compress by counting the runs. For example:
AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD • can then be encoded as:
4A3BAA5B8CDABCB3A4B3CD
• This is not useful for normal text. However, for binary files it can be very effective.
Run-Length Encoding
000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011
Variable-Length Encoding
• Fixed-length encoding uses a static number of symbols (bits) to represent a character.
• For example, the ASCII code uses 8 bits per character.
• Variable-Length encoding assigns shorter codes to common characters. • In English, the most common character is E, hence, we could assign 0 to it.
• However, no other character code can start with 0.
• That is, no character’s code should be a prefix of some other character’s code (unless we somehow put separators between characters, which would take up space).
Variable-Length Encoding
• Suppose our alphabet is {A,B,C,D,E,F,G}
• We analyzed a text and found the following number of occurrences
• The last column shows some sensible codes that we may use for each symbol
• Symbols with higher occurrence have shorter codes
SYMBOL
OCCURRENCE
CODE
A
28
11
B
4
0000
C
14
011
D
5
0001
E
27
10
F
12
010
G
10
001
Tries for Variable-Length Encoding
• A trie is a binary tree used on search applications
• To search for a key we look at individual bits of a key and descend to the left whenever a
bit is zero and to the right whenever it is one
• Using a trie to determine codes means that no code will be the prefix of another
Encoding messages
• To encode a message, we just need to concatenate the codes. For example:
FACE 010 11 011 10
BAGGED 0000 11 001 001 10 0001
• If we were to assign three bits per character, FACE would use 12 bits instead of 10. For BAGGED there is no space savings
SYMBOL
CODE
A
11
B
0000
C
011
D
0001
E
10
F
010
G
001
Decoding messages
• Try to decode 00011001111010 and 000011000100110 using the trie
• Starting from the root, print each symbol found as a leaf
• Repeat until the string is completed
• Remember the rules: Left branch is 0, right branch is 1
• DECAF / BADGE
Huffman Encoding: Choosing the Codes
• Sometimes (for example for common English text) we may know the frequencies of letters fairly well.
• If we don’t know about frequencies then we can still count all characters in the given text as a first step.
• But how do we assign codes to the characters once we know their frequencies?
• By repeatedly selecting the two smallest weights and fusing them.
• This is Huffman’s algorithm – another example of a greedy method. • The resulting tree is a Huffman tree.
Huffman Trees (example)
45 19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
Huffman Trees (example)
45 19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28 BDGFCEA
An exercise
• Construct the Huffman code for data in the table, placing in the tree from left to right [A,B,C,D,_]
• Then, encode ABACABAD and decode 100010111001010
• 0100011101000101 / BAD_ADA
SYMBOL
FEQUENCY
CODE
A
0.40
0
B
0.10
100
C
0.20
111
D
0.15
101
_
0.15
110
Compressed Transmission
• If the compressed file is being sent from one party to another, the parties must agree about the codes used.
• For example, the trie can be sent along with the message.
• For long files this extra cost is negligible.
• Modern variants of Huffman encoding, like Lempel-Ziv compression, assign codes not to individual symbols but to sequences of symbols.
Next lecture
• We briefly discuss complexity theory, NP-completeness and approximation algorithms
• On the final week we will devote time to review all the content