程序代写代做代考 algorithm COMP90038 Algorithms and Complexity

COMP90038 Algorithms and Complexity

COMP90038
Algorithms and Complexity

Lecture 21: Huffman Encoding for Data Compression

(with thanks to Harald Søndergaard & Michael Kirley)

Andres Munoz-Acosta

munoz.m@unimelb.edu.au

Peter Hall Building G.83

mailto:munoz.m@unimelb.edu.au

• The SES is open now. Please take a time to review this subject. All
feedback is greatly appreciated.

• The final exam has been scheduled for 1:15PM on Thursday 8th of
November at Wilson Hall
• It is a closed book exam. No calculators are permitted.
• Reading time will be 15 minutes. Exam duration will be 3 hours.
• Answers must be provided in the exam paper in the space allocated.
• The reverse side of the pages can be used for rough work.
• All questions should be attempted. Some are easier than others.
• Any unreadable parts will be considered wrong. Be neat in your answers.

• A sample exam paper will be provided this week.
• We have instructed the tutors NOT to provide hints on the sample exam

• Assignment 2 is due next Sunday at 11:59PM.
• We will provide sample answers on during the lecture and through the LMS.

• Next week we will use both lectures for a quick review of the content.
• Only examinable topics will be discussed in the review.

Recap

• We discussed greedy algorithms:
• A problem solving strategy that takes the locally best choice among all

feasible ones. Such choice is irrevocable.

• Usually, locally best choices do not yield global best results.

• In some exceptions a greedy algorithm is correct and fast.

• Also, a greedy algorithm can provide good approximations.

• We applied this idea to two graph problems :
• Prim’s algorithm for finding minimum spanning trees

• Dijkstra’s algorithm for single-source shortest path

What is a Minimum Spanning Tree?

• A minimum spanning tree of a weighted graph V,E is a tree V,E’ where
E’ is a subset of E, such that the connections have the lowest cost

• We use Prim’s algorithm to find the minimum spanning tree.
• It constructs a sequence of subtrees T, by adding to the latest tree the closest node

not currently on it.

a c

b d

e

f

6 5

42

42

5
314

a c

b d

e

f

6 5

42

42

5
314

Prim’s Algorithm

• We examined the complete algorithm, that uses priority queues:

cb

f da

e

3
4 4

1

5 5

2
6 8

6

Another example

• Let’s work with the following graph:

• What would happen if we start on b?
• The sequence will be different, but the edges may be

the same

• How many different trees can we have?
• If there are ties, the tie breaking has influence

Tree T a b c d e f

cost      
prev nil nil nil nil nil nil

cost 0     
prev nil nil nil nil nil nil

a
cost 3   6 5

prev a nil nil a a

a,b
cost 1  6 4

prev b nil a b

a,b,c
cost 6 6 4

prev c a b

a,b,c,f
cost 5 2

prev f f

a,b,c,f,e
cost 5

prev f

a,b,c,f,e,d
cost

prev

Dijkstra’s Algorithm

• Dijkstra’s algorithm finds all shortest paths from a fixed start node.
Its complexity is the same as that of Prim’s algorithm.

cb

f da

e

3
4 4

1

5 5

2
6 8

6

Another example

• Let’s work with this graph again:

• What would happen if we start on b?
• It is possible to end up with a different tree

• How many different trees can we have?
• Ties can also influence the final tree.

Tree T a b c d e f

cost      
prev nil nil nil nil nil nil

cost 0     
prev nil nil nil nil nil nil

a
cost 3   6 5

prev a nil nil a a

a,b
cost 4  6 5

prev b nil a a

a,b,c
cost 10 6 5

prev c a a

a,b,c,f
cost 10 6

prev c a

a,b,c,f,e
cost 10

prev c

a,b,c,f,e,d
cost

prev

Data compression

• From an information-theoretic point of view, most computer files
contain much redundancy.

• Compression is used to store files in less space.
• For text files, savings up to 50 are common.

• For binary files, savings up to 90 are common.

• Savings in space mean savings in time for file transmission.

Run-Length Encoding

• For a text with long runs of repeated characters, we could compress by
counting the runs. For example:

AAAABBBAABBBBBCCCCCCCCDABCBAAABBBBCCCD

• can then be encoded as:

4A3BAA5B8CDABCB3A4B3CD

• This is not useful for normal text. However, for binary files it can be very
effective.

Run-Length Encoding

000000000000000000000000000011111111111111000000000

000000000000000000000000001111111111111111110000000

000000000000000000000001111111111111111111111110000

000000000000000000000011111111111111111111111111000

000000000000000000001111111111111111111111111111110

000000000000000000011111110000000000000000001111111

000000000000000000011111000000000000000000000011111

000000000000000000011100000000000000000000000000111

000000000000000000011100000000000000000000000000111

000000000000000000011100000000000000000000000000111

000000000000000000011100000000000000000000000000111

000000000000000000001111000000000000000000000001110

000000000000000000000011100000000000000000000111000

011111111111111111111111111111111111111111111111111

011111111111111111111111111111111111111111111111111

011111111111111111111111111111111111111111111111111

011111111111111111111111111111111111111111111111111

011000000000000000000000000000000000000000000000011

Variable-Length Encoding

• Fixed-length encoding uses a static number of symbols (bits) to represent a
character.
• For example, the ASCII code uses 8 bits per character.

• Variable-Length encoding assigns shorter codes to common characters.
• In English, the most common character is E, hence, we could assign 0 to it.
• However, no other character code can start with 0.

• That is, no character’s code should be a prefix of some other character’s
code (unless we somehow put separators between characters, which
would take up space).

Variable-Length Encoding

• Suppose our alphabet is
{A,B,C,D,E,F,G}

• We analyzed a text and found the
following number of occurrences

• The last column shows some
sensible codes that we may use for
each symbol
• Symbols with higher occurrence have

shorter codes

SYMBOL OCCURRENCE CODE

A 28 11

B 4 0000

C 14 011

D 5 0001

E 27 10

F 12 010

G 10 001

Tries for Variable-Length Encoding

• A trie is a binary tree used on search applications

• To search for a key we look at individual bits of a key and descend to the left whenever a
bit is zero and to the right whenever it is one

• Using a trie to determine codes means that no code will be the prefix of another

Encoding messages

• To encode a message, we just need to concatenate the
codes. For example:

• If we were to assign three bits per character, FACE would
use 12 bits instead of 10. For BAGGED there is no space
savings

F A C E

010 11 011 10

B A G G E D

0000 11 001 001 10 0001

SYMBOL CODE

A 11

B 0000

C 011

D 0001

E 10

F 010

G 001

Decoding messages

• Try to decode 00011001111010
and 000011000100110 using the
trie
• Starting from the root, print each

symbol found as a leaf
• Repeat until the string is

completed

• Remember the rules: Left branch
is 0, right branch is 1

• DECAF / BADGE

Huffman Encoding: Choosing the Codes

• Sometimes (for example for common English text) we may know the
frequencies of letters fairly well.

• If we don’t know about frequencies then we can still count all characters in
the given text as a first step.

• But how do we assign codes to the characters once we know their
frequencies?
• By repeatedly selecting the two smallest weights and fusing them.

• This is Huffman’s algorithm – another example of a greedy method.
• The resulting tree is a Huffman tree.

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

Huffman Trees (example)

45

19

9 26 55

4 5 10 12 14 27 28

B D G F C E A

An exercise

• Construct the Huffman code for data in the table,
placing in the tree from left to right [A,B,C,D,_]

• Then, encode ABACABAD and decode
100010111001010

• 0100011101000101 / BAD_ADA

SYMBOL FEQUENCY CODE

A 0.40 0

B 0.10 100

C 0.20 111

D 0.15 101

_ 0.15 110

Compressed Transmission

• If the compressed file is being sent from one party to another, the
parties must agree about the codes used.
• For example, the trie can be sent along with the message.

• For long files this extra cost is negligible.

• Modern variants of Huffman encoding, like Lempel-Ziv compression,
assign codes not to individual symbols but to sequences of symbols.

Next lecture

• We briefly discuss complexity theory, NP-completeness and
approximation algorithms

• On the final week we will devote time to review all the content