ECE2035 Project One: Bioinformatics: DNA Search
DNA Search: This project explores pattern matching techniques to find a pattern in a DNA sequence containing letters in the DNA alphabet {A, C, G, T}. For example, suppose we have
a DNA sequence as follows:
ATGACGATCTACGTATGGCAGCCACGCTTTTGATGTTAAGTCACACAGCCAAGTCAACAAGGGC
GACTTCATGATCTTTCCGCTCCGTTGGTGTAGGCCCGTGTTCAAATTCAATGGCTGATTGGAAT
TACCTTTGAAATACTCCAACCGACCGCCACGGCCAGGGTCCCGCTCGCTCTCTGTGGCCCTCCC
ACAAAACTCCGGTGAAAGTTGATTTGGACACGGACCCAAAGCAGCGTAGATTATTCGAGCGTAT
TCGGTAGTCATTGAGGCCCCAA
The pattern “GCTTTT” can be found at index 27 (where the first letter of the sequence is at index 0). Note that overlapping matches are treated individually. For example, if the sequence is ‘AAAAAA’ and the pattern is ‘AAA’, there are 4 occurrences of the pattern: at indices 0, 1, 2, and 3. You will write a C program and a MIPS program to find the indices at which a given pattern occurs in a given DNA sequence.
Strategy: Unlike many “function only” programming tasks where a solution can be quickly envisioned and implemented, this task requires a different strategy.
1. Before writing any code, reflect on the task requirements and constraints. Mentally explore different approaches and algorithms, considering their potential performance and costs. The metrics of merit are static code length, dynamic execution time, and storage requirements. There are often trade offs between these parameters.
2. Once a promising approach is chosen, a high level language (HLL) implementation (e.g., in C) can deepen its understanding. The HLL implementation is more flexible and convenient for exploring the solution space and should be written before constructing the assembly version where design changes are more costly and difficult to make. For P1-1, you will write a C implementation of the program.
3. Once a working C version is created, it’s time to “be the compiler” and see how it translates to MIPS assembly. This is an opportunity to see how HLL constructs are supported on a machine platform (at the ISA level). This level requires the greatest programming effort; but it also uncovers many new opportunities to increase performance and efficiency. You will write the assembly version for P1-2.
P1-1 High Level Language Implementation:
In this section, the first two steps described above are completed. It’s fine to start with a simple implementation that produces an answer; this will help deepen your understanding. Then experiment with your best ideas for a better performing solution. Each hour spent exploring here will cut many hours from the assembly level coding exercise.
You should use the simple shell C program that is provided P1-1-shell.c which auto-generates both a new pattern and a new sequence each time it is run. Rename the shell file to P1-1.c and modify it by adding your code.
The provided shell program randomly initializes a text string (10240 characters) with the DNA alphabet. A pattern of 3 to 7 characters is also randomly generated using the DNA alphabet. You must insert code into the shell to implement the function Match which takes four input parameters: the pointer to the pattern text string, the length of the pattern text string, a
pointer to the DNA sequence text string, and the length of the DNA sequence. The Match function must compute the indices of the pattern occurrences and store them in order in the array
1
ECE2035 Project One: Bioinformatics: DNA Search
MatchIndices (a globally defined array). It also stores -1 at MatchIndices[n], where n is the total number of occurrences of the pattern. The -1 indicates the end of the sequence of indices. For example, if the sequence is ‘AACAAC’ and the pattern is ‘AAC’, the array MatchIndices[n]= {0, 3, -1}, where 0 indicates the index of the first occurrence of ‘AAC’ and 3 indicates the index of the second occurrence.
The shell C program includes important print statements used for grading (please don’t change them). You can modify any part of this program. Just be sure that your completed assignment correctly completes the print statements since this is how you will receive points for this part of the project. If you would like to add more print statements as you debug your code, wrap them in an if statement using the DEBUG flag – an example is given in the shell program – so that you can suppress printing them in the code you submit by setting DEBUG to 0. If your submitted code print extraneous output, it will be marked incorrect by the autograder.
Note: you will not be graded for your C implementation’s performance. Only its accuracy and “good programming style” will be considered (e.g., organizing and documenting your code). Your C implementation does not need to use the same algorithm as the assembly program; although it’s much easier for you if it does.
When have completed the assignment, submit the single file P1-1.c to Canvas. You do not need to submit data files. Although it is good practice to employ a header file (e.g., P1-1.h) for dec-
larations, external variables, etc., in this project you should just include this information at the beginning of your submitted program file. In order for your solution to be properly received and graded, there are a few requirements.
1. The file must be named P1-1.c.
2. Your name and the date should be included in the header comment.
3. Your submitted file should contain the unmodified print statements, giving the input pat-
tern, sequence, and identifying the starting locations of the pattern in the sequence.
4. Your program must compile and run with gcc under Linux. Compiler warnings will
cause point deductions. If your program does not compile or enters an infinite loop, it
will earn 0 correctness points.
5. Your solution must be properly uploaded to Canvas before the scheduled due date. The
canvas p1-1 assignment has details on late penalties and policies.
P1-2 Assembly Level Implementation: In this part of the project, you will write the performance- focused assembly program that solves the DNA search problem. A shell program (P1-2- shell.asm) is provided to get you started. Rename it to P1-2.asm.You will call a software interrupt that will generate a random test DNA sequence and a random pattern to search for. Your MIPS assembly code must compute the indices of each occurrence of the pattern in the DNA sequence. As with P1-1, this creates an array of integers, terminated with the integer -1.
Library Routines: There are three library routines (accessible via the swi instruction).
SWI 548: Create DNA Sequence and Pattern: This routine initializes memory beginning at the specified base address, given in register $1, with the randomly generated DNA sequence. In particular, the DNA sequence has a fixed length of 4800 characters from the DNA alphabet. The 4800 characters are encoded into a 600-word array starting at the memory address given in $1. Each character is represented by 2 bits, packed in little endian order in the lower 16 bits of each
2
ECE2035 Project One: Bioinformatics: DNA Search
of the 600 words. The upper 16 bits of all the 600 words are not used. Therefore, bits 0-1 of the first word denotes the first character of the text; bits 2-3 of the same word represents the second character; bits 0-1 of the second word represents the 9th character of the text.
The alphabet is encoded as follows: A: 00
T: 01 G: 10 C: 11
For example, if we have binary number 0000 0000 0000 0000 0000 0000 1110 0100, it represents ATGCAAAA .
Your MIPS code must decode the input word array and unpack the 4800 characters. The pattern is returned in $2 by swi 548. The pattern itself is packed in the lower half word of $2, the length of the pattern (between 3 and 7) is encoded in the upper half word of $2.
In addition, swi 548 will display the DNA sequence as a 80×60 matrix.
INPUTS: $1 should contain the base address of the 600 word (2400 byte) 80×60 array already allocated
in memory. OUTPUTS: Register $2 contains a randomly generated pattern in its lower
half word and the length of the pattern (between 3 and 7) in the upper half word.
SWI 549: Highlight Position: This routine allows you to specify a position in the 80×60 DNA sequence array and it marks the position in the DNA sequence visualization. INPUTS: $1 should contain the position (i.e., the offset into the DNA sequence) to be highlighted. OUTPUTS: none.
This is to be used to help you debug. You should comment out any swi 549 instructions before submitting your code since they will add to your static and dynamic instruction counts.
SWI 580: Matches: This routine allows you to report the indices of occurrences of the
pattern your program found in the DNA sequence. INPUTS: Register $2 must contain the base
address of the array storing the indices of the occurrences of the pattern found in the DNA
sequence. The array must end with -1. OUTPUTS: Register $1 holds the value 1 if the array holds the correct indices and it holds 0 otherwise.
Evaluation: In this version (P1-2), execution performance and cost are both important. The assessment of your submission will include functional accuracy during 100 trials and performance and efficiency. The code size, dynamic execution length, and operand storage requirements are scored empirically, relative to a baseline solution. The baseline numbers for this project are static code size: 52 instructions, dynamic instruction length: 40024 instructions (avg.), total register and memory storage required: 697 words (not including dedicated registers $0, $31). The dynamic instruction length metric is the maximum of the baseline metric and the average dynamic instruction length of the five fastest student submissions.
Your score will be determined through the following equation: PercentCredit = 2 − MetricYour Pr ogram
MetricBaseline Pr ogram
Percent Credit is then used to determine the number of points for the corresponding points
category. Important note: while the total score for each part can exceed 100%, especially bad 3
ECE2035 Project One: Bioinformatics: DNA Search
performance can earn negative credit. The sum of the combined performance metrics scores (code size, execution length, and storage) will not be less than zero points. Finally, the performance scores will be reduced by 10% for each incorrect trial (out of 100 trials). You cannot earn performance credit if your implementation fails ten or more of the 100 trials.
In MIPS assembly language, small changes to the implementation can have a large impact on overall execution performance. Often tradeoffs arise between static code size, dynamic execution length, and operand storage requirements. Creative approaches and a thorough understanding of the algorithm and programming mechanisms will often yield impressive results. Almost all engineering problems require multidimensional evaluation of alternative design approaches. Knowledge and creativity are critical to effective problem solving.
In order for your solution to be properly received and graded, there are a few requirements.
1. The file must be named P1-2.asm.
2. Your program must call SWI 580 to report an answer and return to the operating system via the jr instruction. Programs that include infinite loops or produce simulator warnings or errors will receive zero credit.
3. Your solution must be properly uploaded to Canvas before the scheduled due date.
Project Grading: The project grade will be determined as follows:
part
description
percent
P1-1
DNA Search (C code)
25
P1-2
DNA Search (MIPS assembly)
correct operation, proper commenting & style
25
static code size
15
dynamic execution length
25
operand storage requirements
10
total
100
All code (MIPS and C) must be documented for full credit.
Honor Policy: In all programming assignments, you should design, implement, and test your own code. Any submitted assignment containing non-shell code that is not fully created and debugged by the student constitutes academic misconduct. You should not share code, debug code, or discuss its performance with anyone. Once you begin implementing your solution, you must work alone.
Good luck and happy coding!
4