ruby代写: COMP2150 Assignment 3: Huffman encoding

COMP2150 Assignment 3 Summer 2018

Due:

Friday, August 10, before 11:59 pm. Notes

Please follow both the “Programming Standards” and “Assignment Guidelines” for all work you submit. Programming standards 1-25 are in effect.
Hand-in will be via the UM Learn Assignments facility. Make sure you leave enough time before the deadline to ensure your hand-in works properly.
Official test data will be provided several days before the assignment due date. Until then, work with your own test data.
All assignments in this course carry an equal weight. Assignment 3: Huffman encoding
Structure of the Assignment

I have structured this assignment so you can work on it in stages. The stages are fairly independent, but form part of the overall assignment. I will provide some free code to get you started, which includes test programs. Work on stages and test them with the provided programs. Even if you don’t get the whole program working, you can still get good marks by completing stages.

You will write several modules which implement several data structures. Then you will use those data structures to solve the problem of Huffman encoding of text strings. The marks for the modules will be assigned for successfully running the test programs.

Huffman encoding

Huffman encoding is an interesting application of binary trees. Huffman’s algorithm allows us to produce a minimum data encoding scheme based on the frequencies of characters in a given text dataset. To use Huffman’s algorithm, you first need to read a file of text and produce a list of each individual character in the text and the number of times that character occurred. For this purpose we will create a data structure called an Occurrence List. Once you have these frequencies, you need to gradually build an encoding tree using a Priority Queue (the second module). We begin by making a tree from each of the pieces of frequency data that we just tallied in the file (i.e. many trees with one node in each) and inserting each of these trees in the priority queue (priority in this priority queue is by frequency). After everything is in (there will be a significant size to this queue, as the data file will contain small and capital letters, digits, and special characters), we proceed to build a Huffman tree as follows: repeatedly take the two LOWEST priority items from the queue (so lowest frequencies here – if two items have the same frequency, ensure that the one that is the lowest in the ASCII collating sequence is removed from the priority queue first), build a new tree node with data consisting of the concatenated string of both of those items, with a frequency equal to the total of those two items, and with the two subtrees themselves as left and right children of the node (make the left one the smaller priority

1

COMP2150 Assignment 3 Summer 2018

(lowest frequency) of the two subtrees, again with collating sequence breaking a tie). Then insert that new tree in the priority queue. If we keep doing this, we will eventually end up with a single node (tree) in the priority queue, whose data value consists of the all the string data items in the original frequencies list concatenated together. All the other original trees will be somewhere below this tree – and the more infrequent they were, the deeper they’ll be.

So now we have a Huffman tree. How does this help us encode? Well, since it’s a binary tree (our third module), we can do binary encoding by moving left or right. Say we have a character we want to encode. Start at the top of the tree and look for it (at any point I can tell which to follow because I can look at the strings of data in the left and right subtrees and see which one contains the character). Let us say that if I have to move right to get to the part of the tree the character I am encoding is in I will record a 1, and if I have to move left I will record a 0. As I trace down the tree I’ll record a series of 1’s and 0’s, and will eventually find the data I’m looking for (assuming it was in my original data file!). At that point, the binary digits that were recorded form the encoding for that character (see the example in the Notes) – and the encoding is guaranteed to be as short as possible since all the frequently used characters have the shortest strings (shortest path through the tree). While normal encoding would be REAL binary numbers, of course, we’ll just use strings of 0’s and 1’s.

Modules

Module MyOrderedItem defines an abstract class called OrderedItem. Use the techniques discussed in the notes to define an ‘abstract’ class in ruby. This abstract class is used to ensure sub-classes define two functions: compareTo and to_s.

Module MyOccurrenceList defines a data structure called OccurrrenceList. As in the previous assignments, all your data structures are linked structures that you create on your own. The nodes contain an item and a count, as well as a link to the next node. Provide an increment method in the Node class so that the count in a node may be easily incremented by one.
The occurrence list itself should support the add method, which either adds a new item to the head of the list (with a count of 1), or increments the count of the existing item. You may want to write a find method to support this operation, but you don’t have to. Also write the to_s method, which returns a string containing the contents of the list (items and counts), one item per line.

Module MyPriorityQueue defines a data structure called PriorityQueue. Since a priority queue must have some means of ordering item by priority, ensure that only OrderedItem objects are added to the data structure (use the kind_of? method to check that items added to the priority queue are OrderedItems). Your data structure must support an insert method, a get_first method and an empty? method.

Module MyBST defines a data structure called BST (a binary search tree). This structure also must be able to order items, but use Duck Typing this time to ensure that items can respond to the compareTo message (use respond_to?). Your BST will use the simple insertion technique of simply following a path from the root to an empty subtree, and

COMP2150 Assignment 3 Summer 2018

inserting a new node at that point (we are not concerned about making a balanced tree). As you have probably done in Java, have a public insert method, which calls a private, recursive insert to actually perform the insertion. Provide a to_s method that performs an inorder traversal of the tree (again using a private, recursive method) and returns the string.

Since we are inserting BST objects into a PriorityQueue when performing Huffman encoding, the BST class must extend OrderedItem. Thus, you must also provide a compareTo method (compare the items in the roots of the trees).

Main program

The main program will read input from a file (this is included in the free code). It then produces a Huffman Tree, and uses that to produce an encoding of the original input. The encoding is actually a sequence of ‘0’ and ‘1’ characters, NOT a binary file. It then prints the original string and the ‘encoded’ version.

Output

See the notes on Huffman encoding for an example of Huffman encoding.

Hand-in

Submit all your source code. Include a README explaining how to run your program.

You should also submit a text document containing a short English description of the interactions between objects in your project. This should answer questions like “Which class is initially called?”, “Which method(s) process the input?”, and which objects deal with tasks in the code. This description should be at most one page of text.

Submit all files on UM Learn.

Related Posts