Programming assignment 2
Released on Friday 2021-10-01 00:00:00 +0000
Due Date: 2021-10-16 00:01:00 +0000 00:01
Download
CMSC 423 Project 2 : Overview
This assignment deals with the construction and querying of the su�x array.
In order to make the construction of the su�x array generally useful, it will be
necessary to not only build this index, but also to be able to save it to disk
and load it from disk. To this end, this project will also require you to be able
to serialize and deserialize your index. Your project will consist of 3
executables buildsa , inspectsa , and querysa which are described in more
detail below. Though you will implement 3 programs, you can think of the
project as being broken into two parts.
In the �rst part of the project, you will implement a program to read a
reference sequence from a FASTA �le, to construct the su�x array on this
sequence, and then to write the string and su�x array to �le in a binary
format. You will also implement a program to read the saved �le from disk,
compute some statistics about the su�x array, and write those to a text
format output �le.
In the second part of the project, you will implement a program to read your
serialized su�x array from �le, as well as to read an input FASTA �le
containing many queries. Your program will then produce an output �le with
the query results in a well-speci�ed output format.
University of Maryland
Bioinformatics Algorithms, Databases, and Tools
Fall 2021
HOME LECTURES ASSIGNMENTS
SYLLABUS / COURSE MATERIALS
http://www.cs.umd.edu/
https://rob-p.github.io/CMSC423_F21/
https://rob-p.github.io/CMSC423_F21/
https://rob-p.github.io/CMSC423_F21/lectures/
https://rob-p.github.io/CMSC423_F21/assignments/
https://rob-p.github.io/CMSC423_F21/course-materials/
Note: As was mentioned at the start of the class, the projects will generally
build o� of each other. Project 1 was a bit of an outlier in this regard. The
only capability from project 1 that you will need is the ability to read FASTA
�les, and the ability to make executables using a build.sh script. However,
more overlap will be present between project 2 and future projects, so you
should attempt to build as clean and useable of an interface for your code in
this project as you can — you will likely be using the su�x array as a library in
future projects.
Overall structure
You will submit your assignment as a tarball named CMSC423_F21_A2.tar.gz .
When this tarball is expanded, it should create a single folder named
CMSC423_F21_A2 . This folder must be created in the directory where the
decompression (i.e. tar xzvf ) is done, and must not be nested inside any
other folders. The details of how you structure your “source tree” are up to
you, but the following must hold (to enable proper automated testing of your
programs).
There should be a script at the top-level of CMSC423_F21_A2 called
build.sh . This should do whatever is necessary to create 3 executables at
the top level (one called buildsa and one called inspectsa and one called
querysa ). If you’re comfortable with Make�les, this can just call make , or it
could simply run the commands necessary to compile your programs and
copy them to the top-level directory. You can assume this script is run in a
bash shell.
There should be a README.md �le in the top level directory. This README
�le should contain the following information.
What language have you written your solution in?
What did you �nd to be the hardest part of this assignment?
What resources did you consult in working on this assignment (view
this as a form of citation; you shouldn’t copy code directly from
anywhere in your assignment, but if you consulted other sources
please list them here).
Turnin : The assignment turnin will be handled using Gradescope. We intend
to have the infrastructure for this set up by the end of this week, so please
–
–
–
–
–
check back here for detailed instructions on the submission procedure.
Part (a), constructing and inspecting the suffix array
In this part of the assignment, you will write two programs. The �rst program
will be called buildsa ; it will read in a “genome” (in FASTA ) format, build the
su�x array on this reference, and write the string and su�x array to a binary
�le. The second program will be called inspectsa ; it will read in the binary �le
written by buildsa , and then it will compute some statistics (details below)
about the su�x array and write out a test �le containing these statistics and
the su�x array.
buildsa
buildsa : Input
The input consists of 2 arguments, given in this order:
reference – the path to a FASTA format �le containing the reference of
which you will build the su�x array.
output – the program will write a single binary output �le to a �le with this
name, that contains a serialized version of the input string and the su�x
array.
buildsa : Output
Your program will output a �le with the name given by the output argument
above. This must be a binary �le holding everything necessary to perform
query using your su�x array. Speci�cally, it should include the input string
itself (probably with the sentinel $ appended) and it should also include an
encoding of the entries of the su�x array. Note: The speci�c binary encoding
is up to you — it can be as simple as an integer representing the string length
followed by the bytes of the string and an integer representing the number of
bytes in the su�x array followed by the entries, or something more complex.
You are allowed to use an external serialization library for this component,
but the serialization must be to a binary (not text) format. In C++ you could
use something like cereal or bitsery; in Rust you could use soemthing like rkyv
or serde with bincode. For reasons mentioned previously (and for the sake of
not running into performance issues in future projects), I’d not recommend
–
–
https://uscilab.github.io/cereal/
https://github.com/fraillt/bitsery
https://github.com/rkyv/rkyv
https://github.com/serde-rs/serde
https://github.com/bincode-org/bincode
using Python for this project, but if you do, you can just use the builtin
pickle and cpickle modules.
NOTE : The timeout for all tests for the gradescope server will be 30 minutes.
This will remain the case in future projects as the amount of processing you
have to do generally increases. Thus, we highly recommend that you
implement either the O(m lg m) su�x array construction algorithm from the
original su�x array paper, the O(m) DC3 algorithm we covered in class, or
another e�cient (i.e. not O( lg m)) algorithm.
inspectsa
inspectsa : Input
The input consists of 3 arguments, given in this order:
index – the path to the binary �le containing your serialized su�x array
(as written by buildsa above).
sample_rate – the rate at which su�x array entries will be sampled in your
output (see below).
output – the program will write text output to this �le containing some
basic statistics about the su�x array, as well as a text representation of
the su�x array.
inspectsa : Output
Your program will output a �le with the name given by the output argument
above. The output will consist of text �le with exactly 4 lines. Unlike project 1,
there are no keys and values; rather, the values must be written out in
precisely the order speci�ed here. Your output should contain:
mean LCP1 value : This is a �oating point number representing the average
length of the longest common pre�x shared between subsequent entries
of the su�x array.
median LCP1 value : This is a �oating point number representing the
median length of the longest common pre�x shared between subsequent
entries of the su�x array.
maximum LCP1 value : This is an integer number representing the maximum
length of the longest common pre�x shared between subsequent entries
m
2
–
–
–
–
–
–
of the su�x array.
suffix array spot check : This is a textual representation of certain values
within the su�x array. Speci�cally, this line should encode a list of tab-
separated integers that consists of the following entries of the su�x
array (i.e. the entries in the su�x array appearing at the following indices)
[ sample_rate 0, sample_rate 1, sample_rate 2, …, sample_rate �oor(Su�x
Array Length / sample_rate )].
Part (b), querying the suffix array
In the second part of the assignment, you will implement
Your program for this part should be called querysa . This program will take
as input 4 arguments, your serialized su�x array, a FASTA �le containing
queries, a query mode parameter, and an output �le name. It will perform
query in the SA and report the results in the output �le speci�ed in the
format speci�ed below.
querysa : Input
index – the path to the binary �le containing your serialized su�x array
(as written by buildsa above).
queries – the path to an input �le in FASTA format containing a set of
records. Unlike project 1 you will need to care about both the name and
sequence of these fasta records, as you will report the output using the
name that appears for a record. Note, query sequences can span more
than one line (headers will always be on one line).
query mode – this argument should be one of two strings; either naive or
simpaccel . If the string is naive you should perform your queries using
the naive binary search algorithm. If the string is simpaccel you should
perform your queries using the “simple accelerant” algorithm we covered
in class.
output – the name to use for the resulting output.
querysa : Output
output – the output �le of your program. This �le should contain the
results of your queries in the following format. Each line should contain a
tab-separated list containing the following information:
–
–
–
–
–
–
University of
Maryland
Department of Computer Science
University of Maryland
College Park, MD
query_name , k , hit_1 , hit_2 , hit_k
Here, the query_name is simply the header of the corresponding FASTA entry
(the string after the > — not including the > on the header line). The value
k is the number of occurrences of the query string in the underlying text on
which the su�x array is built. Finally hit_1 through hit_k are the positions
in the original text (0-indexed) where the query string occurs. If a query string
does not occur in the text, then you should report k = 0, and there will be
no hit_1 , … etc. entries for that query.
–