1 DNA fragments (9 marks)
You are maintaining a database for a study into drug resistance in hospitals. During the study, you will receive bacterial DNA collected from patients. Each bacterial DNA sample con- tains a drug resistant gene sequence at the start.
The researchers will need to query your database for various different drug resistant gene sequences over the course of the study (details below), but you will also need to be able to update the contents of the database with new bacterial DNA samples.
To solve this problem, you will need to create a class SeqeunceDatabase, to represent the database. This class will need to have two methods, addSequence(s) and query(q).
As usual, you are welcome to create other functions/methods.
Note: DNA is typically represented using A, T, C and G. For the purpose of easy coding, we will use A, B, C and D (since they have adjacent ASCII values).
1.1 Input
The input to addSequence is a single nonempty string of uppercase letters in uppercase [A-D]. The input to query is a single (possibly empty) string of uppercase letters in uppercase [A-D].
1.2 Output
addSequence(s) should not return anything. It should appropriately store s into the database represented by the instance of SequenceDatabase. We define the frequency of a particular string to be the number of times it has been added to the database in this way.
query(q) should return a string with the following properties:
• It must have q as a prefix
• It should have a higher frequency in the database than any other string with q as a
prefix
• If two or more strings with prefix q are tied for most frequent, return the lexicographically least of them
4
1.3 Example
db = SequenceDatabase()
db.addSequence(“ABCD”)
db.addSequence(“ABC”)
db.addSequence(“ABC”)
db.query(“A”)
>>> “ABC”
db.addSequence(“ABCD”)
db.query(“A”)
>>> “ABC”
db.addSequence(“ABCD”)
db.query(“A”)
>>> “ABCD”
Note that the name “db” in the above example is arbitrary (i.e. the instance of SequenceDatabase could be called anything)
1.4 Complexity
Remember that string comparison is not considered O(1) in this unit. • The __init__ method of SequenceDatabase should run in O(1) • addSequence(s) should run in O(len(s))
• query(q) should run in O(len(q))
5
2 Open reading frames (8 marks)
In Molecular Genetics, there is a notion of an Open Reading Frame (ORF). An ORF is a portion of DNA that is used as the blueprint for a protein. All ORFs start with a particular sequence, and end with a particular sequence.
In this task, we wish to find all sections of a genome which start with a given sequence of characters, and end with a (possibly) different given sequence of characters.
To solve this problem, you will need to create a class OrfFinder. This class will need a method, find(start, end). Also note that the constructor for this class takes a string genome as a parameter, unlike the class in Problem 1 (shown in the example below).
2.1 Input
genome is a single non-empty string consisting only of uppercase [A-D]. genome is passed as an arguement to the __init__ method of OrfFinder (i.e. it gets used when creating an instance of the class).
start and end are each a single non-empty string consisting of only uppercase [A-D]. 2.2 Output
find returns a list of strings. This list contains all the substrings of genome which have start as a prefix and end as a suffix. There is no particular requirement for the order of these strings.
2.3 Example
genome1 = OrfFinder(“AAABBBCCC”)
genome1.find(“AAA”,”BB”)
>>> [“AAABB”,”AAABBB”]
genome1.find(“BB”,”A”)
>>>[]
genome1.find(“AA”,”BC”)
>>>[“AABBC”,”AAABBBC”]
genome1.find(“A”,”B”)
>>> [“AAAB”,”AAABB”,”AAABBB”,”AAB”,”AABB”,”AABBB”,”AB”,”ABB”,”ABBB”]
2.4 Complexity
• The __init__ method of OrfFinder must run in O(N2) time, where N is the length of genome.
• find must run in (len(start) + len(end) + U ) time, where U is the number of characters in the output list.
As an example of what the complexity for find means, consider a string consisting of N “B”s
followed by N “A”s. “BBBB…AAAA…” 2
2
6
If we call find(“A”,”B”), the output is empty, so U is O(1). On the other hand, if we call find(“B”, “A”) then U is O(N2).
7
Warning
For all assignments in this unit, you may not use python dictionaries or sets. This is because the complexity requirements for the assignment are all deterministic worst case requirements, and dictionaries/sets are based on hash tables, for which it is difficult to determine the deter- ministic worst case behaviour.
Please ensure that you carefully check the complexity of each inbuilt python function and data structure that you use, as many of them make the complexities of your algorithms worse. Common examples which cause students to lose marks are list slicing, inserting or deleting elements in the middle or front of a list (linear time), using the in keyword to check for membership of an iterable (linear time), or building a string using repeated concatenation of characters. Note that use of these functions/techniques is not forbidden, however you should exercise care when using them.
These are just a few examples, so be careful. Remember, you are responsible for the complexity of every line of code you write!
8