CS计算机代考程序代写 chain Excel cache DNA python algorithm 3/5/2021 ASSIGNMENT #3 – Working with FASTA data

3/5/2021 ASSIGNMENT #3 – Working with FASTA data
It’s ok to use module re , module os
ASSIGNMENT #3 Working with FASTA data
What your programs MUST do:
You must adhere to the programming specification for this assignment in order to receive full credit. Also you must use command line options for these programs, so pay particular attention to the example usage I have provided. See the solution to assignment 2 for more information on how to implement command line options. You should use argparse. Command line options will automatically put checks into place for things like: When an option should be an integer (type=int), it will check that automatically check this for you:
import argparse
parser = argparse.ArgumentParser(description=’Find Descriptive Statistics for a column in the file’)
parser.add_argument(‘-c’,’–column’, dest=’column_to_parse’,
type=int, help=’Column to parse in the file to open’, required=True
)
1. The data file for this program can be found here (Note this file has been gzipped – Right-click, “Save Link As” – and make sure to gunzip before using).
Suppose we had a company identify all of the amino acids for a group of unknown proteins using Mass Spectrometry (De novo peptide sequencing for mass spectrometry is typically performed without prior knowledge of the amino acid sequence. It is the process of assigning amino acids from peptide fragment masses of a protein) and we also had them computationally determine the secondary structure of those peptide chains.
We had asked them to send us two files. One fasta file with the protein sequences, and the other with the corresponding secondary structure data. But as it turns out they sent us one file, Arghhhhhh!! You are to write a program (pdb_fasta_splitter.py) which will open this file and generate two files. One with the corresponding protein sequence (pdb_protein.fasta), and the other with the corresponding secondary structures (pdb_ss.fasta). Make sure to keep the white spaces intact in pdb_ss.fasta because the position corresponds to the amino acid in pdb_protein.fasta. At the end of the program tell the user how many sequences were found for each of the output files, by printing this out to STDERR.
Here is an example of what the first two sequences look like: Gaps in the secondary structure just mean there is no secondary structure annotation.
>101M:A:sequence
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGA
ILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG
>101M:A:secstr
HHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH GGGGGG TTTTT SHHHHHH HHHHHHHHHHHHHHHH
HHTTTT HHHHHHHHHHHHHTS HHHHHHHHHHHHHHHHHH GGG SHHHHHHHHHHHHHHHHHHHHHHHHT
T
pseudocode for your program
get file handle to sequences in the fasta file (use the get_fh function, see below).
open two outfiles named: pdb_protein.fasta and pdb_ss.fasta (use the get_fh function, see below).
loop over fasta filehandle and store the data in two lists, one for the header line and the other for the sequence data (use the get_header_and_sequence_lists function, see below). process the two lists, go over the header list, and if it matches the amino acid sequence print it to pdb_protein.fasta, otherwise print it to pdb_ss.fasta.
You can hard-code pdb_protein.fasta and pdb_ss.fasta file names into your pdb_fasta_splitter.py
Examples of the program being run (Follow the same output):
1. $ python3 pdb_fasta_splitter.py –infile ss.txt Found XX protein sequences
Found XX ss sequences
# Comment only Note: the XX above in the output should be the actual number of sequences found
2. $ python3 pdb_fasta_splitter.py -h
usage: pdb_fasta_splitter.py [-h] -i INFILE
Give the fasta sequence file name to do the splitting
optional arguments:
-h, –help show this help message and exit
-i INFILE, –infile INFILE
Path to the file to open
3. $ python3 pdb_fasta_splitter.py
usage: pdb_fasta_splitter.py [-h] -i INFILE
pdb_fasta_splitter.py: error: the following arguments are required: -i/–infile
155.33.203.122/teaching/BINF6200-Spring2021/local/assignments/assign_fasta.html 1/5

3/5/2021 ASSIGNMENT #3 – Working with FASTA data
4. $ python3 pdb_fasta_splitter.py –infile ss_designed2Fail.txt The size of the sequences and the header lists is different Are you sure the FASTA is in correct format
Use the following file to test if the program exists correctly. Note, there should be no output files created, see assignment 2’s solution for more help on unit testing.
You must implement the following functions (Name them exactly as instructed, and provide the same arguments and call them in the same context as instructed. Failure to do so will result in points being deducted):
1. Write a function (call it get_fh) that receives two arguments: 1). A file name 2). How to open a file for reading or writing (“r”, or “w”) in Python. The purpose of this function is to open the file name passed in, and passes back a a file object, aka the file handle or handle. You can call this function like this:
fh_in = get_fh(file_2_open, “r”) or fh_out = get_fh(file_2_write, “w”)
When using open(), make sure to use try .. except .. except if the open was not successful the program should raise the right Exception(it should raise an IOError for when the file cannot be opened, and raise a ValueError when the wrong argument was passed for the opening mode, e.g. “rrr” instead of “r”). We will test for things like a file that does not exist for opening, or the wrong open mode, e.g. mode=’rrr’. See open for more information. All opening and closing of files in your program should use the get_fh function. Failure to do so will loose points. Make sure to close your file handle.
2. Write a function (call it get_header_and_sequence_lists) that receives one argument: 1). A file handle to the fasta file used in this program. The function will return two lists. One lists for the sequences in the file and one list for the headers to the sequences in the file. There should be a one-to-one correspondence to the data in the lists. Meaning element 1 of the header list should correspond to element 1 of the sequence list. If implemented correctly, you can call this function like this:
# send off the filehandle
# get back data in lists
list_headers, list_seqs = get_header_and_sequence_lists(fh_in);
This function should sys.exit(“The size of the sequence list…“) if it could not successfully get two lists of equal size (see _check_size_of_lists below). The function _check_size_of_lists does the actual exiting, but that will get called to by the get_header_and_sequence_lists function. This function can be tested by a unit test.
Make sure “newline \n” characters have been removed from your sequence data. Note, remove newline characters, not spaces, since the secondary structure string might contain spaces!
3. Write a function (call it _check_size_of_lists [Note the “_” in front of the name of the function]) that receives two arguments: 1). The header list found in step 2 directly above. 2). The sequence list found in step 2 directly above. This is a helper function that will be called in the get_header_and_sequence_lists function (which is why it starts with a “_” in front of the name, since it’s not really to be called in the main part of your program, i.e the Single Pre Underscore is only meant to use for the internal use). If the sizes of the lists passed into this function are not the same, it should exit, (telling the user why it exited), else it returns (return True).
If implemented correctly, you can call this function like this:
# check to make sure data looks good. Note, no return value, so no assignment
_check_size_of_lists (list_headers, list_seqs);
2. In your second program (nucleotide_statistics_from_fasta.py) you are to open the following file (Note this file has been gzipped, Right-click, “Save Link As” – and make sure to gunzip before using).). Store the data in two lists like we did before in step 1. Now for each sequence I would like to know the number of A’s, T’s, G’s, C’s, and any other type of nucleotide, call all of these N’s. I’d also like to know the the length of the sequence and also the %GC content of the entire sequence.
pseudocode for your program
get file handle to sequences in the fasta file (use get_fh function- the name of the output file will come from a command line option)
open one outfile (use get_fh function- the name of the output file will come from a command line option)
loop over fasta filehandle and store the data in two lists, one for the header line and the other for the sequence data (use get_header_and_sequence_lists function, see below).
process the lists, and determine the necessary output I want to see (use print_sequence_stats function, see below). Examples of the program being run:
1. $ python3 nucleotide_statistics_from_fasta.py –infile influenza.fasta –outfile influenza.stats.txt
2. $ python3 nucleotide_statistics_from_fasta.py -h
usage: nucleotide_statistics_from_fasta.py [-h] -i INFILE -o OUTFILE
Give the fasta sequence file name to get the nucleotide statistics
optional arguments:
-h, –help show this help message and exit
-i INFILE, –infile INFILE
Path to the file to open
-o OUTFILE, –outfile OUTFILE
Path to the file to write
155.33.203.122/teaching/BINF6200-Spring2021/local/assignments/assign_fasta.html 2/5

3/5/2021 ASSIGNMENT #3 – Working with FASTA data
3. $ python3 nucleotide_statistics_from_fasta.py
usage: nucleotide_statistics_from_fasta.py [-h] -i INFILE -o OUTFILE
nucleotide_statistics_from_fasta.py: error: the following arguments are required: -i/–infile, -o/–outfile
4. $ python3 nucleotide_statistics_from_fasta.py –infile influenza.fasta
usage: nucleotide_statistics_from_fasta.py [-h] -i INFILE -o OUTFILE nucleotide_statistics_from_fasta.py: error: the following arguments are required: -o/–outfile
You must implement the following functions (Name them exactly as instructed, and provide the same arguments and call them in the same context as instructed. Failure to do so will result in loss of points):
1. Write a function (call it get_fh, and implement it as the same as above)
2. Write a function (call it get_header_and_sequence_lists, and implement it as the same as above) 3. Write a function (call it _check_size_of_lists, and implement it as the same as above)
4. Write a function (call it print_sequence_stats) that receives three arguments. 1). The header list found in step 2 directly above. 2). The sequence list found in step 2 directly above. 3). The output filehandle the stats will be written too. This is the main function of this program, since it will print the top line of the output (see below), and each sequence’s numerical values. It will call two helper functions (_get_accession and _get_nt_occurrence – see below) that will be called for each sequence prior to printing the data for each sequence out. I can call this function like this:
# send of the list
# process the sequences and print out the data print_sequence_stats(list_headers, list_seqs, fh_out);
5. Write a function (call it _get_nt_occurrence) that receives two arguments. 1). The character to find the occurrence of in the dna sequence. 2). The sequence data for that entry (string). I can call this function like this:
num_As = _get_nt_occurrence(‘A’, seq); num_Cs = _get_nt_occurrence(‘C’, seq); …
The function should only take A, G, C, T, or N. If any other character is given other than that set, it should sys.exit(“Did not code this condition”) the program. 6. Write a function (call it _get_accession) that receives one argument. 1). A string that is the header to the sequence. And returns the accession number. I can call this
function like this:
accession_string = _get_accession(header_string);
The tab delimited output file named by the command line option should look like this (1 decimal point):
Number Accession A’s G’s C’s T’ N’s Length GC% 1 EU521893 20 20 20 20 0 80 50.0
The numbers above are not accurate, and the Number column is just incremented with each new sequence. The 1st Header line in the sequence input file, should have EU521893 in the output file. Also the white space above represents a single tab between each value, this way you can easily open it in excel!
If you implemented these programs correctly, you should end up with a very short main (4-5 lines of code) – This does not include the command line options, checking the options, closing the filehandles, or comments)
3. Implement test scripts with your programs. You should name these test_nucleotide_statistics_from_fasta.py and test_pdb_fasta_splitter.py. You must get coverage up to 20-30% to receive points for this component
Here is an example on how to test for IOError, this will test if your get_fh works correctly when it raises an IOError
import pytest
def test_get_fh_4_IOError():
# does it raise IOError
# this should exit
with pytest.raises(IOError):
get_fh(“does_not_exist.txt”, “r”)
Remember, you’re just testing the functions you wrote do what they are expected. This will increase your coverage.
Also create a .coveragerc file in your assignment3 directory (Make sure to update what you have for your path to Python)
[run] omit =
test_*
/usr/local/lib/python3.*/*
To see the html report do (Note on defiance you need to call pytest like: /usr/local/bin/py_backup/pytest
$ pytest –cov-report html –cov –cov-config=.coveragerc # see assignment 2’s solution for more on testing,
Please make sure to review the HTML coverage! This is very helpful in showing you where your code has and has not been tested. See solution #2 coverage html for an example, and the follow the link to descriptive_statistics_py.html, and see the “missing” tab’s colors
To not see the html e.g. when on Defiance, do (Note the TOTAL coverage of 79%):
155.33.203.122/teaching/BINF6200-Spring2021/local/assignments/assign_fasta.html 3/5

3/5/2021 ASSIGNMENT #3 – Working with FASTA data
$ pytest –cov –cov-config=.coveragerc # see assignment 2 solutions for more on testing
plugins: cov-2.8.1
collected 18 items
test_nucleotide_statistics_from_fasta.py ……….
test_pdb_fasta_splitter.py ……..
[ 55%]
[100%]
———- coverage: platform darwin, python 3.6.0-final-0 ———–
Name Stmts Miss Cover
———————————————————
nucleotide_statistics_from_fasta.py 81 11 86%
pdb_fasta_splitter.py 65 19 71%
———————————————————
TOTAL 146 30 79%
If when you run pytest you see other files be calculated in your coverage (see below)
========================================================================================== test session starts ==================
platform linux — Python 3.6.8, pytest-5.3.2, py-1.8.1, pluggy-0.13.1
rootdir: /home/cleslin/assignment3
plugins: cov-2.10.1
collected 18 items
test_nucleotide_statistics_from_fasta.py ……….
test_pdb_fasta_splitter.py ……..
———– coverage: platform linux, python 3.6.8-final-0 ———–
Name Stmts Miss Cover ——————————————————————————————– nucleotide_statistics_from_fasta.py 81 11 86% pdb_fasta_splitter.py 65 19 71% /usr/local/lib/python3.6/site-packages/_pytest/_argcomplete.py 34 33 3%
/usr/local/lib/python3.6/site-packages/_pytest/_code/code.py
/usr/local/lib/python3.6/site-packages/_pytest/_code/source.py
/usr/local/lib/python3.6/site-packages/_pytest/assertion/__init__.py
/usr/local/lib/python3.6/site-packages/_pytest/assertion/rewrite.py
/usr/local/lib/python3.6/site-packages/_pytest/cacheprovider.py
/usr/local/lib/python3.6/site-packages/_pytest/capture.py
/usr/local/lib/python3.6/site-packages/_pytest/compat.py
/usr/local/lib/python3.6/site-packages/_pytest/config/__init__.py
/usr/local/lib/python3.6/site-packages/_pytest/config/argparsing.py
/usr/local/lib/python3.6/site-packages/_pytest/debugging.py
.. .. ..
You’ll need to add the path to your version of python to your .coveragerc like I did up above.
Notes:
632 616 3%
225 218 3%
71 62 13%
590 515 13%
243 204 16%
458 385 16%
168 129 23%
652 552 15%
238 192 19%
191 182 5%
Use the above pseudocode for your algorithm development
Go through the steps of software development before you begin any implementation
Remember to have all function names required, using the exact spelling
All functions need to have a docstring
Cut and paste the commands above and make sure your code works on your local environment and Defiance
Pay close attention to the data files used and the output above
Your code must pass pylint, and the flake8. See lecture 02, and assignment 2’s solution for more information. See the shell script from the solution to assignment 2 (run_lints.sh)
create a .coveragerc file in the working directory like I showed you above
Get your coverage up to 20-30% for your test programs: test_nucleotide_statistics_from_fasta.py and test_pdb_fasta_splitter.py pytest –cov-report html –cov –cov-config=.coveragerc # see assignment #2 solutions for more on testing, and the .coveragerc file
You will have to use the try / except we learned about in class (Lecture 3)
Make sure to put the homework in /home/your_user_name/programming6200/assignment3/
Make sure you do this with the same case, or you will loose points
Use the following file to test if the program exists correctly. Note, there should be no output files created, see assignment 2’s solution for more help on unit testing
DO NOT INCLUDE INPUT FILES IN YOUR SUBMISSION
Please submit your program(s) for grading by March 7th by 12:00 PM, (Noon), follow the same submission requirements found here.
155.33.203.122/teaching/BINF6200-Spring2021/local/assignments/assign_fasta.html 4/5
=

3/5/2021 ASSIGNMENT #3 – Working with FASTA data
Table of Contents. Course Home Page.
155.33.203.122/teaching/BINF6200-Spring2021/local/assignments/assign_fasta.html 5/5