COMP5112 / MSBD5009 Parallel Programming (Spring 2022)
Assignment 1: MPI Programming
Submission deadline: 23:59 (pm) on Mar. 16 (Wednesday), 2022
I. General Notes
Copyright By PowCoder代写 加微信 powcoder
1. This assignment counts for 15 points.
2. This is an individual assignment. You can discuss with others and search online resources, but
your submission must be your own code. Plagiarism checking will be enforced, and penalty will
be given to all students involved in an incident of academic dishonesty.
3. Add your name, student id, and email at the first two lines of comments in your submission.
4. Submit your assignment through Canvas before the deadline.
5. All submission will be compiled and tested uniformly on given machines of the course (CSE
Lab 2 machines for COMP5112 and Azure virtual machines for MSBD 5009).
6. Please direct any inquiries about this assignment to the designated TAs listed in the CANVAS
assignment page.
7. No late submission will be accepted!
II. Problem Description
This assignment is on the MPI parallelization of a super-mer generation algorithm given an input list of DNA fragments, called reads. Each read is represented as a string of length 𝑛, and the characters in the string include ‘A’, ‘T’, ‘C’, and ‘G’. A k-mer is a substring of length 𝒌 of a read, so a read of length 𝑛 contains 𝑛 − 𝑘 + 1 k-mers. A minimizer of length 𝒑 of a k-mer is the lexicographically smallest length-𝑝 substring of the k-mer. Finally, a super-mer is generated by merging consecutive k-mers in the read that have the same minimizer. Two k-mers are consecutive in a read if their starting positions in the read are consecutive. Figure 1 illustrates the super-mer generation algorithm (Algorithm 1) with a simple example.
Input:𝑘 = 9, 𝑝 = 5, 𝑛 = 14
𝑅𝑒𝑎𝑑 = CAAATTACTGCATA
Super-mer Generation (mimimizers are underlined)
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 ←AAATT,𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ← 1,𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ← 0 no update on minimizer
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 ← 𝑛𝑒𝑤_𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 ←AATTA, 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ← 2 generate current super-mer, 𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ← 2
i=0 (k-mer #1)
i=1 (k-mer #2)
i=2 (k-mer #3)
i=2 (super-mer #1) CAAATTACTG
super-mer #1 is made up of k-mer #1 and #2, minimizer = AAATT
i=3 (k-mer #4) ATTACTGCA 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 ← 𝑛𝑒𝑤_𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 ←ACTGC, 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ← 7
i=3 (super-mer #2) AATTACTGC generate current supermer, 𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ← 3 super-mer #2 is made up of k-mer #3 only, minimizer = AATTA
𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ≠ 𝑛 − 𝑘, generate the last super-mer super-mer #3 is made up of k-mer #4 #5 #6, minimizer = ACTGC
i=4 (k-mer #5)
i=5 (k-mer #6)
(super-mer #3)
ATTACTGCATA
Figure 1: Illustration of super-mer generation (The minimizers are underlined) 1/4
III. Sequential Algorithm
Algorithm 1: Super-mer Generation (sequential) Input: readstring 𝑺 = 𝑠0,𝑠1,𝑠2,…,𝑠𝑛−1
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
k-mer length 𝒌, minimizer length 𝒑 Procedure GenerateSupermer (𝑺, 𝒌, 𝒑):
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 = the minimum 𝒑-substring of 𝑺[𝟎, 𝒌 − 𝟏] 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 = the starting position of minimizer in 𝑺 𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 = 0
for𝑖from𝟏to𝑛−𝑘do:
if 𝑖 > 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 then:
𝑛𝑒𝑤_𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 =theminimum 𝒑-substringof 𝑆[𝑖,𝑖+𝑘−1] update 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠
if 𝑆[𝑖+𝑘−𝑝,𝑖+𝑘−1] ≤ 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 then:
𝑛𝑒𝑤_𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 = 𝑆[𝑖+𝑘−𝑝,𝑖+𝑘−1]
update 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 end if
if 𝑛𝑒𝑤_𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 ≠ 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 then:
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟 = 𝑛𝑒𝑤_𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑟
savecurrentsuper-mer: 𝑺[𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠,𝑖+𝑘−1] 𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 = 𝑖
end if end for
if 𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠 ≠ 𝑛 − 𝑘 then:
save the last super-mer: 𝑺[𝑠𝑢𝑝𝑒𝑟𝑚𝑒𝑟_𝑏𝑒𝑔_𝑝𝑜𝑠, 𝑛]
end procedure
IV. Your Implementation Task
The code skeleton “gensupermer_mpi.cpp” is provided. In this assignment, your task is to
complete the missing code in the specified section in the main function:
1. Scatter the read data to each MPI processes.
2. Perform the super-mer generation in each process (you can call the provided read2supermers).
3. Gather all the super-mers to Process 0, and store them in a vector of strings (variable
“all_supermers” in which each super-mer is a string). The order of the super-mers in the vector is unimportant.
1. CSR Format: The CSR (compressed sparse row) format stores all rows in a single data array
and has an offset array to point to the beginning position of each row in the data array. For example, we store all reads in the array 𝑟𝑒𝑎𝑑_𝐶𝑆𝑅, and the index of each read in the array 𝑟𝑒𝑎𝑑_𝐶𝑆𝑅_𝑜𝑓𝑓𝑠𝑒𝑡. Figure 2 shows an example of four reads AC, ACT, AG, and GCT stored in CSR.
Figure 2: An Example of CSR Format
2. We provide the function: read2supermers. The function takes one read as input, generates all super-mers in the read, and stores them in a vector
3. For more information on MPI APIs, you may refer to https://www.open-mpi.org/doc/v4.0/.
V. The Given Package
gensupermer _mpi.cpp gensupermer _sequential.cpp dataset/*.txt result/ utilities.hpp
The MPI program skeleton for you to fill in the missing code in the designated place.
The sequential version of super-mer generation. It is for your reference and result comparison.
Each text file is a test dataset with each line containing a read.
An empty folder for super-mer output.
A header file containing utility functions such as file loading, result output, and result comparison.
VI. Compile and Run
1. We will not apply any compiler optimization during the assessment (e.g., -O2 -O3 …).
2. No external library or header file is allowed (e.g., Boost or other third-party libraries).
An example of input file and corresponding output file:
The input file contains multiple lines of reads. The following is an example containing 5 reads.
Example input: dataset1.txt
The output will be the super-mers generated from the reads. The order of the super-mers in the output file is lexicographic. For example, the 7th and the last super-mers in the example output file are generated from the first read in the example input file. The 7th super-mer contains 7 consecutive k-mers. These 7 k-mers share the same minimizer AAAACTG.
GATAACGAGTTCTAGAAAACTGGGTCCC
AGATCAGCAAACCCTGAGAAAAA AGAAAAATTAGCAATAATTAGCAGTGTTCATAACA AGAAGACAACTGGGCCCGGGGGAC GAGGGAGAACCTGATTTCCAGAGT
AAAATTAGCAATAATTAGCAG AAATTAGCAATAATTAGCAGT AATTAGCAATAATTAGCAGTGTTCATAA AGAAAAATTAGCAATAATTAGCA AGAAGACAACTGGGCCCGGGGGAC AGATCAGCAAACCCTGAGAAAAA ATAACGAGTTCTAGAAAACTGGGTCCC ATAATTAGCAGTGTTCATAACA GAGGGAGAACCTGATTTCCAGAGT GATAACGAGTTCTAGAAAACT
Example output (k=21, p=7): my_gs_output.txt
Example of compiling and running the sequential program:
# Compile:
g++ gensupermer_sequential.cpp -o seq_gs -std=c++11
# Run and save super-mers to text file:
# (e.g., k=21, p=7, data file is ./dataset/dataset1.txt, and result file is ./result/seq_gs_output.txt):
./seq_gs 21 7 ./dataset/dataset1.txt ./result/
# Run without saving super-mers to text file:
./seq_gs 21 7 ./dataset/dataset1.txt
Example of compiling and running your MPI program:
# Compile:
mpic++ -std=c++11 gensupermer_mpi.cpp -o gs
mpiexec -n
#
VII. Submission and Evaluation
You only need to submit your completed gensupemer_mpi.cpp to Canvas before the deadline. You can add your code only in the specified section. Your program should not have any extra output to the result file or the standard output.
We will evaluate your program in different parameter settings. We will vary 𝑝 and 𝑘 (10 ≤ 2𝑝 ≤ 𝑘 < 23). The length of each read is greater than 22. Both the total length of all reads and that of the generated super-mers will be less than INT_MAX (0x7FFFFFFF). The number of MPI processes will be set to values less than 16. We will prepare multiple sets of evaluation arguments and datasets. Your grade will be marked based on both correctness and best speedup achieved.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com