The genetic code of all living organisms is represented by a long sequence of simple molecules called nucleotides, or bases, which make up human DNA. There are four nucleotides: A, C, G, and T. The genetic code of a human is a string of 3.2 billion made of the letters A, C, G, and T.
In this problem we search for a substring of length k that occurs most frequently in the human genome.
DNA is composed of a string of ‘a’, ‘g’, ‘c’, and ‘t’s, e.g.,
Copyright By PowCoder代写 加微信 powcoder
atcaatgatcaacgtaagcttctaagcatg
atcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttg
tatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttct
tggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccata
ttgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgt
ttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaa
gccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacga
tttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgac
tcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaat
gatcaagctgctgctcttgatcatcgtttc
1- Write a function that takes a long string of letters and creates a list of all possible k letter sequences. For example, the string ‘gcacttgcatgcac’ has the following 3 letter sequences:
[gca,cac,act, ctt, ttg,tgc,gca, cat,atg, tgc,gca,cac]
The function should find which of the above sequences occur most often. In this example, ‘gca’ appears three times and other sequences occur once.
The function returns the highest occurring substring and its count.
2-Write a function that calls the above function. We will search for a sub-sequence of k letters that occurs most frequently. k is a variable between min_length and max_length. For example, if min_length=4 and max_length= 8, the program first finds which 4-letter sequence occurs most and with what frequency. It does the same thing for 5, 6, 7, and 8-letter sequences. Then it will print for example ‘cttt’, 52. It means that the highest frequency of occurrence occurred in 5-letter sequences and the highest frequencies in other cases were lower than 52.
Here is an example file Download example filewith the first 6000 basepairs of the DNA of a virus. Calling mostCommonSubstring(dna,4,9) on that dataset should run in no more than a minute or two. That’s way too slow to use on a whole DNA sequence of millions of basepairs, but reasonable given how much Python you know now. Here’s what the output of the program would look like:
Data file: initial6000.txt
mostCommonSubstring(dna,4,9) = ‘cttt’, 52
Data file: initial6000.txt
mostCommonSubstring(dna,3,6) = ‘ttt’, 176
Data file: initial6000.txt
mostCommonSubstring(dna,5,10) = ‘gcttt’, 20
Here are the steps I recommend to finish this project. It is unlikely you can do this all in one day, so plan accordingly:
1. Create a flowchart or pseudo-code code algorithm for each of the functions described in the flowing steps.
2. Write a program that sets a variable, dna, to a short string and also sets the variable, k, to a small integer. This program runs a loop printing out each k-length substring in dna.
3. Once that is working, modify the program to do more inside the loop. Now store the substring in a variable, target. Then make a nested loop that counts how many times the target is seen in dna. Print out that count.
4. Once that is working, modify the program to keep track of which target string produces the highest count. (Hint: use another pair of variables, highestCount and targetWithHighestCount, to keep track of the best target seen so far.) Print out targetWithHighestCount at the end of your program.
5. Now turn your working program from the step above into a function. Instead of setting dna and k upfront, make them parameters to the function. Instead of printing targetWithHighestCount, return it. Name your function mostCommonK.
6. Next write a program that sets dna to a short-ish string and then calls mostCommonK repeatedly with different values for the k argument, say for 2, 3, 4, 5, and 6. Now do something similar to what you did in step #3 and print out the value for the k argument that caused mostCommonK to return the largest value.
7. Now change the program in the step above into a function, mostCommonSubstring.
8. Next, write a program that does what runMostCommonSubstring needs to do. It asks the user for the name of the file and the values for mink and maxk, opens the file and reads its contents into the variable dna, then calls mostCommonSubstring(dna, mink, maxk) and prints the answer. Test this program on a file with a small dna string in it before attempting the big one. Once it works on a small file, then try the big one. (See the first bullet below for file reading instructions.)
9. Change the previous step into a function, runMostCommonSubstring. Test it. And hand it in!
· The following code is for reading in a file. It assumes that the data file is in the same folder as your program and that it has a shortened name, ‘initial6000.txt’
f = open(‘initial6000.txt’)
dna = f.read()
You can also mention the location of the file if it is not in the same folder as your program. For example
f = open(‘F:/my_homework/python_examples/initial6000.txt’
dna = f.read()
After the read, the variable dna will have the entire 6000 character string as its value.
· Make sure that you have enough comments and docstrings including the program header and function description.
Submissions:
(1) Submit a flowchart for the function mostCommonK
(2) Submit your Python source file PA2-