CS计算机代考程序代写 scheme data structure c/c++ algorithm ARIZONA STATE UNIVERSITY

ARIZONA STATE UNIVERSITY

CSE 310, SLN 91082 — Data Structures and Algorithms —
Project #1

Encoding and decoding schemes are used in a wide variety of applications, such as in music or video streaming,
data communications, storage systems (e.g., on CDs, DVDs, RAID arrays), among many others. In a fixed-
length encoding scheme each symbol is assigned a bit string of the same length. An example is the standard
ASCII code. One way of getting an encoding scheme that yields a shorter bit string on average is to assign
shorter codewords to more frequent symbols and longer ones to less frequent symbols. Such a variable-length
encoding scheme was used in the telegraph code invented by Samuel Morse. In Morse code, frequent letters
such as e (·) and a (· -) are assigned short sequences of dots and dashes while infrequent letters such as q (-

– · -) and z (- – ··) are assigned longer ones.
In this project you will implement a variable-length encoding and decoding scheme, and run experiments

to evaluate the effectiveness of the scheme, and the efficiency of your algorithms.

Note: This project must be completed individually. Your implementation must use C/C++ and
ultimately your code must run on Gradescope which will be configured to match the Linux machine
general.asu.edu available to everyone in this course.

You may not use any external libraries to implement any part of this project, aside from the standard
libraries for I/O, dynamic memory allocation, and string functions (stdio.h, stdlib.h, string.h, and their
equivalents in C++). If you are in doubt about what you may use, ask me.

By convention, your program should exit with a return value of zero to signal that all is well; non-zero
values are used to signal abnormal situations.

You must use a version control system as you develop your solution to this project, e.g., GitHub or
similar. Your code repository must be private to prevent anyone from plagiarizing your work.

The rest of this project description is organized as follows. §1 gives the requirements for Project #1
including a description of the encoding and decoding schemes to implement in §1.1 and §1.2, respectively.
§2 gives the experiments to design to evaluate the effectiveness and efficiency of the encoding and decoding
schemes. Finally, §3 describes the submission requirements for the milestone and full project deadlines.

1 Program Requirements for Project #1

1. Write a C/C++ program that implements the encoding scheme described in §1.1 on plain text. Using
a makefile, the program must be compiled into an executable named encode. It must read one
command line parameter, either insertion or merge. In all cases, you must read the text to encode
from stdin, allowing redirection from a text file. You must write the encoded plain text to stdout,
allowing redirection to a text file. See §1.1 for detailed instructions of the encoding algorithm.

2. Write a C/C++ program that implements the decoding scheme described in §1.2 on input in the format
produced by encoding scheme as prescribed in §1.1. Using a makefile, the program must be compiled
into an executable named decode. In all cases, you must read input from stdin allowing redirection
from an encoded text file. You must write the decoded input to stdout, allowing redirection to a text
file. The decoded input should match the original input, i.e., the input prior to encoding. See §1.2 for
detailed instructions of the decoding algorithm.

3. Design experiments to evaluate your programs as described in §2. A brief report with figures plotting
the data you collect, and interpretation of your results is expected.

Sample input will be provided on Canvas. You will use Gradescope to test the correctness of your
programs. Therefore, absolutely no changes to these project requirements are permitted.

1

http://www.asciitable.com/

1.1 The Encoding Algorithm

Given a text file (assuming Linux end-of-line), for each line of the file which may contain any ASCII character:

1. Transform the line into a form that is more amenable to compression. The transformation rearranges
the characters in the input into many clusters of repeated characters, in a way that makes it possible
to recover the original input.

2. Write the compressed form of the transformed line to stdout.

To transform a line of text, treat the line as a string of length N . First, compute (and store) every cyclic
shift of the string, shifting to the right by one character. Then sort the N cyclic shifts in co-lexicographic
order according to the ASCII code. Note the ordering of characters in the ASCII code. You should expect
lines of text to contain upper and lower case letters, numbers, spaces, and any other characters in the ASCII
character table (not in extended ASCII).

Table 1 shows the transformation step for the example string mississippi, of length N = 11. Under
Original, rows with index 0, . . . , 10 show the cyclic shift of the string to the right by as many characters, i.e.,
at index 5 is the string shifted cyclically to the right by 5 characters. Under Sorted, all N cyclic shifts of
the string are sorted in co-lexicographic order. One way to put strings in co-lexicographic order is to reverse
each string, sort them in lexicographic order, then reverse them again.

Table 1: Transformation Step of the Encoding Algorithm
Index Original Index Sorted

0 m i s s i s s i p p i 0 s s i s s i p p i m i
1 i m i s s i s s i p p 1 m i s s i s s i p p i
2 p i m i s s i s s i p 2 s s i p p i m i s s i
3 p p i m i s s i s s i 3 p p i m i s s i s s i
4 i p p i m i s s i s s 4 i s s i s s i p p i m
5 s i p p i m i s s i s 5 p i m i s s i s s i p
6 s s i p p i m i s s i 6 i m i s s i s s i p p
7 i s s i p p i m i s s 7 s i s s i p p i m i s
8 s i s s i p p i m i s 8 s i p p i m i s s i s
9 s s i s s i p p i m i 9 i s s i p p i m i s s
10 i s s i s s i p p i m 10 i p p i m i s s i s s

The compressed output for a non-empty line consists of two lines:

1. One line consisting of the index of the row in which the original string appears in the Sorted column.
2. A second line consisting of the encoding. To produce the encoding, form a string first consisting of

the first character of each Sorted string. In this string, which is some permutation of the original
string, characters form clusters of characters of size one or more. To encode first, step through the
string from left to right processing the clusters: For each cluster, output the cluster size, followed by
a space, followed by the character in the cluster. Finally, output the LF (\n) character.

If a line is empty, i.e., consists of a LF control character only, then the output is a LF character only.
For the example string, the original string appears at index position one in the Sorted column. The first

character of each Sorted string concatenated is a new string first=smspipissii. The first two characters
are each in a cluster of size one. In fact, all clusters except the last two have size one. Therefore, the encoding
of the string smspipissii is:

1

1 s 1 m 1 s 1 p 1 i 1 p 1 i 2 s 2 i

As is discussed in §2, the encoding for this string is not particularly good, i.e., the encoding does not
compress the string very much. The eleven characters in the input mississippi are compressed into nine.

2

http://www.asciitable.com/

1.1.1 Input to the Encoding Algorithm

The encoding algorithm has one input parameter taken from the command line: It is a keyword indicating
the sorting algorithm to use. The text to encode must be read from stdin, which may be redirected from
a file in Unix/Linux format. It is important to know that in Window files, the end of line is signified by two
characters, Carriage Return (CR) followed by Line Feed (LF) while in Linux, only LF is used. Similarly,
your output must be written to stdout, which may be redirected to a text file.

Some examples of input to your encode program:

encode insertion encoded_ex1.txt

encode merge encoded_ex2.txt

You must implement both the Insertion Sort and the Mergesort sorting algorithms. If the command
line parameter is insertion then you must use Insertion Sort whenever sorting is required in the encoding
algorithm. Likewise, if the command line parameter is merge then the sorting algorithm used is Mergesort.

The strings in Original are to be sorted in co-lexicographic order. If you do not sort them according to
this ordering your encoding will be incorrect.

Do not sort the strings directly because it will result in too much data movement. To be
efficient, sort pointers to the strings instead.

1.1.2 Output of the Encoding Algorithm

The output of the encoding scheme is text in which each line of the input is encoded as described in §1.1.
Two lines of output are produced for every non-empty line of input.

1.2 The Decoding Algorithm

Now we describe how to decode a line, i.e., recover the original line. There are three steps in the recovery:

1. Read the integer giving the index of the row in which the original string appears in the Sorted column.
2. Recover the string first; produce last.
3. Using first and last, compute the prev column, and use it together with the index to recover the

original string.

Of course, this process must be repeated for each encoded line of the file.
Recall that the encoded output of a line containing the string mississippi is:

1

1 s 1 m 1 s 1 p 1 i 1 p 1 i 2 s 2 i

From this encoding, the first character of each line of the Sorted array is recovered; assume this is also stored
in a character array (string) first=smspipissii. Now sort the characters in first lexicographically to
produce a string last=iiiimppssss, that corresponds to the last character of each row in the Sorted array;
see both Table 1 and Table 2. Without knowing the other characters between the first and the last in Table
2, the goal is to recover the original string.

For i = 0, . . . N − 1, define prev[i] as the index of the row containing the cyclic shift of Sorted[i] to
right by one character. What is amazing is that the information in the encoding is enough to construct prev
and, from that, the original string!

Consider the character m which only occurs once in the original string mississippi and the array Sorted
is formed from cyclic shifts. Specifically, m is in the last position in the array in row 4. Therefore, we can
deduce that prev[4] = 1 because shifting m to the right by one cyclically moves it into the first position, and
the only row with an m in the first column has index 1.

However all the other characters in mississippi occur multiple times (there are four i, two p, and four
s), so how can we tell how to compute prev in these cases? For character p, for example, it may seem
ambiguous whether prev[5] = 3 and prev[6] = 5, or prev[5] = 5 and prev[6] = 3.

3

Table 2: Reconstructing prev from the Encoding
Index Sorted prev

0 s ? ? ? ? ? ? ? ? ? i 4
1 m ? ? ? ? ? ? ? ? ? i 6
2 s ? ? ? ? ? ? ? ? ? i 9
3 p ? ? ? ? ? ? ? ? ? i 10
4 i ? ? ? ? ? ? ? ? ? m 1
5 p ? ? ? ? ? ? ? ? ? p 3
6 i ? ? ? ? ? ? ? ? ? p 5
7 s ? ? ? ? ? ? ? ? ? s 0
8 s ? ? ? ? ? ? ? ? ? s 2
9 i ? ? ? ? ? ? ? ? ? s 7
10 i ? ? ? ? ? ? ? ? ? s 8

As it turns out, there is a rule that resolves the ambiguity. It is: If row index i and j both end with the
same letter and i < j, then prev[i] < prev[j]. This rule implies that prev[i=5] = 3 and prev[j=6] = 5. Why is this rule valid? Recall that the rows are sorted co-lexicographically, so row 5 is less than row 6 in co-lex order. This means that the nine unknown characters in row 5 must be less than the nine unknown characters in row 6 (since both rows end with the letter p). We also know that between the two rows that end with p, row 3 is less than row 5 in co-lex order. But, the nine unknown characters in row 5 and 6 are precisely the last nine characters in rows 3 and 5, respectively. Thus, prev[5] = 3 and prev[6] = 5 or this would contradict the fact that the strings are sorted co-lexicographically. Using the rule allows all remaining ambiguities to be resolved and all entries of prev to be computed. Once prev has been computed, recovering the original string mississippi from the encoded string is straightfor- ward; it is recovered in reverse order. We are given the index=1 to begin recovery. Using index, first, and prev decoding is given by the following pseudocode (which, of course, you must generalize): index = 1; // Index of row in which original string appears int prev[11] = { 4, 6, 9, 10, 1, 3, 5, 0, 2, 7, 8 }; // Computed as just described char first[12] = "smspipissii", // Reconstruct first from encoding char original[12]; // Recover original string in reverse order original[ strlen(first) ] = ’\0’; // All strings are NULL terminated pos = prev[ index ]; for( i = strlen( first ) - 1; i >= 0; i– ){

original[ i ] = first[ pos ]; // Recover original in reverse order

pos = prev[ pos ];

}

This results in the output mississippi, i.e., the original string of the line is successfully recovered.

1.2.1 Input to the Decoding Algorithm

The input to the decoding scheme is text in the format generated by the encoding scheme. As with the
encoding scheme, the text to decode must be read from stdin, which may be redirected from a file in Linux
format. Similarly, your output must be written to stdout, which can be redirected to a text file.

If the input to the encoding scheme consists of k non-empty lines, then its output has 2k encoded lines.
Thus the decoding scheme iterates k times in order to decode the k encoded non-empty lines of input.

1.2.2 Output of the Decoding Algorithm

The output of the decoding should match the input to the encoding exactly. Note that some care will be
needed to process the LF characters correctly. (Try using the diff commands to compare the two files.)

4

2 Experimentation

A standard measure of the “goodness” of a compression algorithm’s effectiveness is the compression ratio.
This is the ratio t−c

t
× 100%, where t is the total number of characters in the input, and c is the number of

clusters in the encoding.
For example, the encoding of the word mississippi, the total number of characters is t = 11, and the

number of clusters is c = 9, and so the compression ratio is 2
11
× 100 or only 18%. (Here, we are ignoring the

fact we used integers to code the cluster sizes; this can be done more intelligently but this project is already
enough work, right? ,)

Design a set of experiments to study:

1. The average compression ratio; in addition to the average, compute the minimum, maximum, and
standard deviation of the compression ratio. You might consider using a box and whiskers plot for this
metric.

2. The time to encode each input for each type of sort, i.e., for Insertion and for Mergesort. Plot the run
time as a function of input size.

3. The time to decode each encoded input. Plot the run time as a function of input size.
4. The encoding algorithm has been described as encoding one line at a time. If you instead encode

2, 3, . . . lines at a time, does the compression ratio improve? What do you expect to happen and why?

3 Submission Instructions

Submissions are always due before 11:59pm on the deadline date. Do not expect the clock on your machine
to be synchronized with the one on Canvas or on Gradescope!

1. The milestone is due on Sunday, 09/12/2021. See §3.1 for requirements.
2. The complete project is due on Sunday, 09/26/2021. See §3.2 for requirements.

It is your responsibility to submit your project well before the time deadline!!! Late projects
are not accepted.

An unlimited number of submissions are allowed. The last submission will be graded.

3.1 Requirements for Milestone Deadline

For the milestone deadline you are to implement the encoding scheme described in §1.1 and you only need
to support the insertion command line parameter, i.e., all sorting is to be done using the Insertion Sort
algorithm. (You still need to read the parameter on the command line parameter.

Submission instructions for the milestone as well as a grading rubric will be posted on Canvas. A demo
of Gradescope will be provided in the first recitation. Sample input will be provided on Canvas.

The milestone is worth 30% of the total project grade.

3.2 Requirements for Complete Project Deadline

For the full project deadline, you are to implement both the encoding and decoding schemes, both sorting
algorithms. You are also required to conduct the experiments with both encode and decode described in §2,
and summarize your results, and interpretation of those results in a report. The report should be in PDF
format.

Submission instructions for the full project, as well as a grading rubric will be posted on Canvas.

5

Program Requirements for Project #1
The Encoding Algorithm
Input to the Encoding Algorithm
Output of the Encoding Algorithm

The Decoding Algorithm
Input to the Decoding Algorithm
Output of the Decoding Algorithm

Experimentation
Submission Instructions
Requirements for Milestone Deadline
Requirements for Complete Project Deadline