COMP90038 Algorithms and Complexity
COMP90038
Algorithms and Complexity
Lecture 21: Huffman Encoding for Data Compression
(with thanks to Harald Søndergaard & Michael Kirley)
Andres Munoz-Acosta
munoz.m@unimelb.edu.au
Peter Hall Building G.83
mailto:munoz.m@unimelb.edu.au
• The SES is open now. Please take a time to review this subject. All
feedback is greatly appreciated.
• The final exam has been scheduled for 1:15PM on Thursday 8th of
November at Wilson Hall
• It is a closed book exam. No calculators are permitted.
• Reading time will be 15 minutes. Exam duration will be 3 hours.
• Answers must be provided in the exam paper in the space allocated.
• The reverse side of the pages can be used for rough work.
• All questions should be attempted. Some are easier than others.
• Any unreadable parts will be considered wrong. Be neat in your answers.
• A sample exam paper will be provided this week.
• We have instructed the tutors NOT to provide hints on the sample exam
• Assignment 2 is due next Sunday at 11:59PM.
• We will provide sample answers on during the lecture and through the LMS.
• Next week we will use both lectures for a quick review of the content.
• Only examinable topics will be discussed in the review.
Recap
• We discussed greedy algorithms:
• A problem solving strategy that takes the locally best choice among all
feasible ones. Such choice is irrevocable.
• Usually, locally best choices do not yield global best results.
• In some exceptions a greedy algorithm is correct and fast.
• Also, a greedy algorithm can provide good approximations.
• We applied this idea to two graph problems :
• Prim’s algorithm for finding minimum spanning trees
• Dijkstra’s algorithm for single-source shortest path
What is a Minimum Spanning Tree?
• A minimum spanning tree of a weighted graph V,E is a tree V,E’ where
E’ is a subset of E, such that the connections have the lowest cost
• We use Prim’s algorithm to find the minimum spanning tree.
• It constructs a sequence of subtrees T, by adding to the latest tree the closest node
not currently on it.
a c
b d
e
f
6 5
42
42
5
314
a c
b d
e
f
6 5
42
42
5
314
Prim’s Algorithm
• We examined the complete algorithm, that uses priority queues:
cb
f da
e
3
4 4
1
5 5
2
6 8
6
Another example
• Let’s work with the following graph:
• What would happen if we start on b?
• The sequence will be different, but the edges may be
the same
• How many different trees can we have?
• If there are ties, the tie breaking has influence
Tree T a b c d e f
cost
prev nil nil nil nil nil nil
cost 0
prev nil nil nil nil nil nil
a
cost 3 6 5
prev a nil nil a a
a,b
cost 1 6 4
prev b nil a b
a,b,c
cost 6 6 4
prev c a b
a,b,c,f
cost 5 2
prev f f
a,b,c,f,e
cost 5
prev f
a,b,c,f,e,d
cost
prev
Dijkstra’s Algorithm
• Dijkstra’s algorithm finds all shortest paths from a fixed start node.
Its complexity is the same as that of Prim’s algorithm.
cb
f da
e
3
4 4
1
5 5
2
6 8
6
Another example
• Let’s work with this graph again:
• What would happen if we start on b?
• It is possible to end up with a different tree
• How many different trees can we have?
• Ties can also influence the final tree.
Tree T a b c d e f
cost
prev nil nil nil nil nil nil
cost 0
prev nil nil nil nil nil nil
a
cost 3 6 5
prev a nil nil a a
a,b
cost 4 6 5
prev b nil a a
a,b,c
cost 10 6 5
prev c a a
a,b,c,f
cost 10 6
prev c a
a,b,c,f,e
cost 10
prev c
a,b,c,f,e,d
cost
prev
Data compression
• From an information-theoretic point of view, most computer files
contain much redundancy.
• Compression is used to store files in less space.
• For text files, savings up to 50 are common.
• For binary files, savings up to 90 are common.
• Savings in space mean savings in time for file transmission.
Run-Length Encoding
• For a text with long runs of repeated characters, we could compress by
counting the runs. For example:
AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD
• can then be encoded as:
4A3BAA5B8CDABCB3A4B3CD
• This is not useful for normal text. However, for binary files it can be very
effective.
Run-Length Encoding
000000000000000000000000000011111111111111000000000
000000000000000000000000001111111111111111110000000
000000000000000000000001111111111111111111111110000
000000000000000000000011111111111111111111111111000
000000000000000000001111111111111111111111111111110
000000000000000000011111110000000000000000001111111
000000000000000000011111000000000000000000000011111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000001111000000000000000000000001110
000000000000000000000011100000000000000000000111000
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011000000000000000000000000000000000000000000000011
Variable-Length Encoding
• Fixed-length encoding uses a static number of symbols (bits) to represent a
character.
• For example, the ASCII code uses 8 bits per character.
• Variable-Length encoding assigns shorter codes to common characters.
• In English, the most common character is E, hence, we could assign 0 to it.
• However, no other character code can start with 0.
• That is, no character’s code should be a prefix of some other character’s
code (unless we somehow put separators between characters, which
would take up space).
Variable-Length Encoding
• Suppose our alphabet is
{A,B,C,D,E,F,G}
• We analyzed a text and found the
following number of occurrences
• The last column shows some
sensible codes that we may use for
each symbol
• Symbols with higher occurrence have
shorter codes
SYMBOL OCCURRENCE CODE
A 28 11
B 4 0000
C 14 011
D 5 0001
E 27 10
F 12 010
G 10 001
Tries for Variable-Length Encoding
• A trie is a binary tree used on search applications
• To search for a key we look at individual bits of a key and descend to the left whenever a
bit is zero and to the right whenever it is one
• Using a trie to determine codes means that no code will be the prefix of another
Encoding messages
• To encode a message, we just need to concatenate the
codes. For example:
• If we were to assign three bits per character, FACE would
use 12 bits instead of 10. For BAGGED there is no space
savings
F A C E
010 11 011 10
B A G G E D
0000 11 001 001 10 0001
SYMBOL CODE
A 11
B 0000
C 011
D 0001
E 10
F 010
G 001
Decoding messages
• Try to decode 00011001111010
and 000011000100110 using the
trie
• Starting from the root, print each
symbol found as a leaf
• Repeat until the string is
completed
• Remember the rules: Left branch
is 0, right branch is 1
• DECAF / BADGE
Huffman Encoding: Choosing the Codes
• Sometimes (for example for common English text) we may know the
frequencies of letters fairly well.
• If we don’t know about frequencies then we can still count all characters in
the given text as a first step.
• But how do we assign codes to the characters once we know their
frequencies?
• By repeatedly selecting the two smallest weights and fusing them.
• This is Huffman’s algorithm – another example of a greedy method.
• The resulting tree is a Huffman tree.
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
Huffman Trees (example)
45
19
9 26 55
4 5 10 12 14 27 28
B D G F C E A
An exercise
• Construct the Huffman code for data in the table,
placing in the tree from left to right [A,B,C,D,_]
• Then, encode ABACABAD and decode
100010111001010
• 0100011101000101 / BAD_ADA
SYMBOL FEQUENCY CODE
A 0.40 0
B 0.10 100
C 0.20 111
D 0.15 101
_ 0.15 110
Compressed Transmission
• If the compressed file is being sent from one party to another, the
parties must agree about the codes used.
• For example, the trie can be sent along with the message.
• For long files this extra cost is negligible.
• Modern variants of Huffman encoding, like Lempel-Ziv compression,
assign codes not to individual symbols but to sequences of symbols.
Next lecture
• We briefly discuss complexity theory, NP-completeness and
approximation algorithms
• On the final week we will devote time to review all the content