CS代考 COMP9319 2022T2 Assignment 2: Compressed BWT Backward Search

COMP9319 2022T2 Assignment 2: Compressed BWT Backward Search
Your task in this assignment is to create a search program that implements BWT backward search, which can efficiently search a run-length encoded and BWT transformed (RLB) record file without decoding the file back to a larger form. The original record file as plain text file (before BWT) format is:
[][][]… …
where , , , etc. are integer values that are used as unique record identifiers (increasing, not necessarily consecutive, positive integers);

and , , , etc. are text values, which include any visible ASCII alphabets (i.e., any character with ASCII value from 32 to 126 inclusively), tab (ASCII 9) and newline (ASCII 10 and 13). For simplicity, there will be no open or close square bracket in the text values.

After the above text file is BWT transformed, it will be run-length encoded to save storage space. Since the BWT file only includes ASCII characters up to ASCII value 126. The first bit of a byte is used to determine if a byte represents an ASCII character, or if it is part of a number that represents the count. That is, to differentiate a count from a character, a count will have the first bit on, and use the remaining 7 bits to represent the actual value. When a number is larger than 2^7, you will need to consider any subsequent “count” byte(s) to form the final value. Furthermore, to minimize the size of a count, 3 runs are represented as 0 (that is, 3 is the minimum number used to represent runs using count). An example in the next section (under ‘Run-length Examples’) will illustrate this run-length encoding scheme in detail.

Your C/C++ program, called bwtsearch, accepts a command argument of either -n for the number of matching substrings (count duplicates) or -a for listing the identifiers of all the matching records (no duplicates and in ascending order); the path to a RLB encoded file; the path to an index file; and a quoted query string (i.e., the search term) as commandline input arguments. The search term can be up to 512 characters. To make the assignment easier, we assume that the search is case sensitive.

If -a is specified, using the given query string, bwtsearch will perform backward search on the given RLB encoded file, and output the sorted and unique identifiers (no duplicates) of all the records that contain the input query string to the standard output. Each identifier is enclosed in a pair of square brackets, one line (ending with a ‘\n’) for each match. If -n is specified, given a query string, bwtsearch will output the total number of matching substrings (count duplicates) to the standard output. The output is the total number, with an ending newline character.

Your solution is allowed to write out one external index file that is no larger than the size of the given, input RLB file. If your index file is larger than the size of the input RLB file, you will receive zero points for the tests that generating/using that file. You may assume that the index file will not be deleted during all the tests for a given RLB file, and all the test RLB files are uniquely named. Therefore, to save time, you only need to generate the index file when it does not exist yet.
Usage Examples
Suppose the original file (dummy.txt) before BWT is:
[3]Computers in industry[25]Data compression[33]Integration[40]Big data indexing

Some examples:
%grieg> bwtsearch -n ~cs9319/a2/dummy.rlb ~/dummy.idx “in”
%grieg> bwtsearch -a ~cs9319/a2/dummy.rlb ~/dummy.idx “in”
%grieg> bwtsearch -n ~cs9319/a2/dummy.rlb dummy.idx “in ”
%grieg> bwtsearch -a ~cs9319/a2/dummy.rlb dummy.idx “In”
You can find the above dummy.rlb file plus some sample files by logging into CSE machines and going to folder ~cs9319/a2. Note that the original text and BWT files (i.e., .txt and .bwt files) are provided there for your reference only. Your solution should not assume the availability of these files during the assignment marking. You will only be provided with RLB encoded files (.rlb files).
Testing Examples
To assist you to test the correctness of your program easily, two script programs (called count and fetch) have been made and are available in ~cs9319/a2. Instead of taking a RLB file as input, they take the original text file as input and then produce the output that a correct assignment solution is expected to produce. During marking, we will use the diff command to determine if your solution output is different from the expected output, as shown in the examples below. Note that the script programs are memory based and may not scale to 10MB+ testcases, but they will still help you to establish your solution correctness by starting with files of small or medium size.

~cs9319/a2/count
Usage: /import/kamen/A/cs9319/a2/count FILE.txt PATTERN
~cs9319/a2/count ~cs9319/a2/dummy.txt “in”
~cs9319/a2/fetch ~cs9319/a2/dummy.txt “in”
~cs9319/a2/count ~cs9319/a2/dummy.txt “in” > 1.ans
~cs9319/a2/fetch ~cs9319/a2/dummy.txt “in” > 2.ans

bwtsearch -n ~cs9319/a2/dummy.rlb ~/dummy.idx “in” > 1.output
bwtsearch -a ~cs9319/a2/dummy.rlb ~/dummy.idx “in” > 2.output

diff 1.ans 1.output
diff 2.ans 2.output

~cs9319/a2/count ~cs9319/a2/medium.txt “data”

Run-length Examples
The folder ~cs9319/a2 also includes some examples to illustrate the run-length encoding scheme used in this assignment.
cd ~cs9319/a2
more abcde.bwt
aAAbbbBBBBcccccCCCCCCdDDeeeEEEE

xxd -b abcde.rlb
00000000: 01100001 01000001 01000001 01100010 10000000 01000010 aAAb.B
00000006: 10000001 01100011 10000010 01000011 10000011 01100100 .c.C.d
0000000c: 01000100 01000100 01100101 10000000 01000101 10000001 DDe.E.

As shown in the above xxd output, the first bit of a byte is used to determine if a byte represents an ASCII character or a count, i.e., a count will have the first bit on. Therefore, the above xxd output can be summarized as:
aAAb{0}B{1}c{2}C{3}dDDe{0}E{1}
where the value inside a pair of brackets represents its corresponding count observed from the xxd output. As mentioned earlier, 3 is the minimum number used to represent runs. You can observe from the above example why this will lead to the minimum size of a count.

Another example is to illustrate multi-byte counts, i.e., when a count is larger than 2^7. Consider two BWT files a150 and a20k (contain 150 a’s and 20,000 a’s, respectively) in ~cs9319/a2
xxd -b a150.rlb
00000000: 01100001 10010011 10000001 a..
xxd -b a20k.rlb
00000000: 01100001 10011101 10011100 10000001 a…

Since 3 (3 runs) is the minimum number and is represented as 0 count, 150 is represented as 147 (in binary 10010011). As shown above, a large count will be divided into 7-bit blocks by starting from the least significant bit. Therefore, the last 7 bits representing 147 (10010011) will be captured by the first byte of the count (10010011), and the remaining bit representing 147 (10010011) will be captured in the subsequent byte (10000001).

Similarly, 20,000 is represented as 19,997 (in binary 1 0011100 0011101) and it will be divided into 3 bytes accordingly: 10011101 10011100 10000001.

1. It does not matter if your program needs to be executed as ./bwtsearch instead of bwtsearch.
2. To avoid large runtime memory for sorting, none of the testcases for marking will result in more than 5,000 matches when -a is specified
3. The input filename is a path to the given RLB encoded file. Please open the file as read-only in case you do not have the write permission (e.g., those files located in ~cs9319/a2)
4. Marks will be deducted for output of any extra text, other than the required, correct answers (in the right order). This extra information includes (but not limited to) debugging messages, line numbers and so on.
5. You can assume that the input query string will not be an empty string (i.e., “”). Furthermore, search terms containing only numbers, search terms containing any square bracket, search terms containing dash dash (i.e., “–“), or search terms resulting in no matches will not be tested.
6. You may assume that offset >= 0 and will fit in an unsigned int.
7. When counting the number of substring matches (i.e., with -n option), to make it easier for your implementation of backward search matching, all combinations of matches should be counted, e.g., 2 matches of “ana” on “banana”.
8. You are allowed to use one external index file to enhance the performance of your solution. However, if you believe that your solution is fast enough without using index file, you do not have to generate the file. Even in such a case, your solution should still accept a path to the index file as one of the input arguments as specified.
9. A record (in the original record/text file before BWT) will not be unreasonably long, e.g., you will not see a text value that is 5,000+ chars long.
10. Empty records may exist in the original files (before BWT). However, these records will never be matched during searching because the empty string will not be used as a search term when testing your program.
11. Your source code may be inspected. Marks may be deducted if your code is very poor on readability and ease of understanding.

Marking & Performance
This assignment is worth 35 points, all based on auto marking.

We will use the make command below to compile your solution. Please provide a makefile and ensure that the code you submit can be compiled on grieg, a CSE Linux machine. Solutions that have compilation errors will receive zero points for the entire assignment.
Your solution should not write out any external files other than the index file. Any solution that writes out external files other than the index file will receive zero points for the entire assignment.

Your solution will be compiled and tested on grieg.

In addition to the output correctness, your solution will also be marked based on space and runtime performance. Your soluton will not be tested against any RLB encoded files generated from source text files that are larger than 160MB.

Runtime memory is assumed to be always less than 16MB. Any solution that violates this memory requirement will receive zero points for that query test. Runtime memory consumption will be measured by valgrind massif with the option –pages-as-heap=yes, i.e., all the memory used by your program will be measured. Your code may be manually inspected to check if memory is allocated in a way that avoids the valgrind detection and exceeds the above limit.

Any solution that runs for more than 60 seconds on grieg for the first query on a given RLB file will be killed, and will receive zero points for the queries for that RLB file. After that, any solution that runs for more than 10 seconds for any one of the subsequent queries on that RLB file will be killed, and will receive zero points for that query test. We will use the time command (i.e., /usr/bin/time) and take the sum of the user and system time as runtime measurement.

Bonus marks (up to 3.5 points, i.e., 10 percent) will be awarded for the solution that achieves 35 points (i.e., full marks) and runs the fastest overall (i.e., the shortest total time to finish all the tests). Note: regardless of the bonus marks you receive in this assignment, the maximum final mark for the subject is capped at 100.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts