FIT3155 S2/2023: Assignment 2 (Due night 11:55pm on Fri 24 Sept 2023)
[Weight: 20 = 10 + 10 marks.]
Your assignment will be marked on the performance/efficiency of your program. You must write all the code yourself, and should not use any external library routines, except those that are considered standard.
Follow these procedures while submitting this assignment:
Copyright By PowCoder代写 加微信 powcoder
The assignment should be submitted online via moodle strictly as follows:
All your scripts MUST contain your name and student ID.
Use gzip or Winzip to bundle your work into an archive which uses your student ID as the file name.
– Your archive should extract to a directory which is your student ID.
– It should contain 2 subdirectories q1/ and q2/. Read each question’s specification
carefully and place your attempts in their corresponding subdirectories.
Submit your archive electronically via Moodle.
Strictly adhere to the specifications listed in each question.
Do NOT hard-code input filenames in your solutions. These should be passed from the command line.
Academic integrity, plagiarism and collusion
Monash University is committed to upholding high standards of honesty and academic integrity. As a Monash student your responsibilities include developing the knowl- edge and skills to avoid plagiarism and collusion. Read carefully the material available at https://www.monash.edu/students/academic/policies/academic-integrity to understand your responsibilities. As per FIT policy, all submissions will be scanned via MOSS or JPLAG.
Generative AI not allowed!
This unit fully restricts you from availing/using generative AI to answer these assessed questions.
Question 1: Linear time BWT generation [10 Marks]
In this exercise you are asked to write a linear time algorithm to generate a Burrows- (BWT) of any given input string str[1 . . . n]. Your implementation should be fully based on the algorithms taught/learnt in FIT3155.
Strictly follow the following specification to address this task:
Program name: genbwt.py
Argument to your program: A filename, whose contents are the input string str[1 . . . n].
It is safe to assume that there are no line breaks in the input file, and that all characters str[1…n] are lexicographically larger than the terminal character $, which your program will append to str[1…n] after reading the string from the input file.
Command line usage of your script:
python genbwt.py
Output format: The output file is another string corresponding to the BWT of the original input string str[1 . . . n]$.
Example: If str[1 . . . n]$ = woolloomooloo$, then the output BWT string would be: ooolooooolwml$
Question 2: Approximate Pattern matching under a Ham- ming distance threshold using BWT [10 Marks]
You have learnt how to handle the exact pattern matching problem using a BWT in this unit. This exercise will require you to generalize this searching to pattern matching at or below a specified threshold/limit of Hamming distance. Note, two equal-length strings have a Hamming distance of d, if and only if there are d opposing characters that differ between them.
Strictly follow the following specification to address this task:
Program name: hdbwtpm.py Arguments to your program:
1. Filename containing a preprocessed BWT string bwt[1 . . . n + 1].
The input is a preprocessed BWT of some original reference text txt[1 . . . n]$.
For instance, the output of Question 1 can be used as input in this question.
You can safely ignore all non-printable characters (e.g. new lines) if they exist in this input file. All other characters in this file are printable ASCII characters in the range [36, 126], and $ (ASCII 36) is the unique terminal character (originally appended to the original reference text before generating its BWT).
2. Filename containing a pattern pat[1 . . . m] (to search within the reference text using BWT as the search index).
You can safely ignore all non-printable characters (e.g. new lines at the end) if they exist in this input file.
You are also free to assume that the pattern strings have characters in the ASCII range of [37, 126] (i.e., will never contain a character, $).
3. The Hamming distance threshold max d (for approximate pattern matching in the text).
For this exercise, assume that max d ≤ 2. Also, you are free to assume that the input files will be such that 2 ≤ m ≤ n.
Command line usage of your script:
python hdbwtpm.py
Output format: The output file should report only the number of matches of the pattern within the text for increasing values of hamming distance d ∈ {0, 1 . . . max d} in the format specified below (no need to report the positions of (approximate) matches of the pattern in the original reference text1):
Example: If the original reference text were txt[1 . . . n]$ = woolloomooloo$, the first argu- ment to your program will be a file containing its BWT: bwt[1 . . . n+1] = ooolooooolwml$. If the pattern string were pat[1 . . . 5] = olloo and input value of max d was 2, then the output you should report in output hdbwtpm.txt would be:
d = 0, nMatches = 1
d = 1, nMatches = 1
d = 2, nMatches = 2
To understand the above output, consider the following annotated illustration of matches of the pattern (and its Hamming distance) to the length-m substrings of the reference text):
woolloomooloo
olloo d=5 ~~~~~
woolloomooloo
woolloomooloo
| woolloomooloo | woolloomooloo
| olloo d=3 | olloo d=3 |~~~ |~~~ ||
| woolloomooloo | woolloomooloo
| olloo d=4 | olloo d=4 |~~~~ |~~~~ ||
| woolloomooloo | woolloomooloo
| olloo d=2 | olloo d=1 | ~~ | ~
1Although not necessary for this assignment, note that you will additionally need to input the corresponding suffix array as input along with the BWT to be also able to report the exact positions in the reference text where the (approximate) pattern matches appear.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com