程序代写代做代考 python DNA Quality filtering fastq files with Python

Quality filtering fastq files with Python
Introduction
This week’s task will be centered on (part) of the same sea turtle ddRADseq data as last week. However, in this case you will use Python and relevant modules to filter out low quality data, and then demultiplex the remaining data into the read pairs for each sample ID. One key aspect is to keep read pairs (R1 and R2 reads) aligned, i.e., that the R1 and R2 reads are in the same order in both the R1 and R2 fastq file.
Tasks
In this assignment, use Python along with the modules SeqIO from Biopython, numpy and re to clean the fastq data on a per read pair basis. In other words, in the final, cleaned fastq files for each sample should be two fastq files named [sample_name]_R1_fastq, and [sample_name]_R2_fastq. The read pair order must match between R1 and R2, i.e., both R1 and R2 fastq files must contain exactly the same reads and in the same order (the colored number below denotes the read number).

In addition to the above (detailed below) you also need to organize your project in a project directory with a logic structure (i.e., files and folders, a README file that explains etc.) which each contains data/scripts/output etc.
Figure 1. A R1-2 read pair (same read number)
@D00230L:402:H37WYBCXY:1:1106:9962:59589 1:N:0:ACAGTG
AGTGTTAAAGCTTATAAAGGCAATATGCCTATGCTTATGGTAAAAAACCTTAAGACAGTCTGATAAATTTAGCTTCATTTATGCATCTACATGTTTGGAAC
+
DDDDDIIIIII@HHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIIIIIHIIIIIIIIIIIIIIIIHIIIIHH

@D00230L:402:H37WYBCXY:1:1106:9962:59589 2:N:0:ACAGTG
CGGACATTGCTATTCTGATTCACTCTTCACTTACCCTCCATTTGTCCCATGCATAATGTTTCCTTTTTCCAGCATCAACTCTGTGACAAGATCAAAAATGG
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIGIIIIHIHHI
Notes: An example of two fastq records from a read pair. The orange colored numbers are (here 59589) the read index (i.e., same number, same read pair). The green numbers denote read R1 (1) or R2 (2). In our case we used the restriction endo-nucleases HindIII (in yellow) and mspI (in blue), for R1 and R2 respectively.
Filtering and filtering requirements
Remove both reads from each read pair where;
• Either the R1 or R2 read have a mean quality score (phred quality) below 30 in the first or last 10 bases in either R1 or R2. The positions we refer to are underlined in Fig. 1.
• The R1 restriction endonuclease recognition site DNA sequence is malformed, i.e. not in the expected region (yellow Fig. 1, pay attention to the lengths of barcodes used in these data)
• Contains any uncalled bases (‘N’)
• Shorter than the expected 101 base pairs long
The script should report the following:
• The number of individual R1 and R2 reads that passed and failed filtering, as well
• The number of read pairs that was removed and kept

Write the filtered paired fastq records to a new fastq file called cmydas_1m_R1_cleaned.fastq and cmydas_1m_R2_cleaned.fastq, respectively.
Demultiplexing data
Using the filtered fastq files (i.e., cmydas_1m_R1_cleaned.fastq and cmydas_1m_R2_cleaned.fastq) demultiplex the reads (keeping the read pairs aligned) based on the expected barcodes. The final fastq files should be named as follows _R1.fastq and _R2.fastq. For instance, if the sample ID is BO150013 then the file names should be BO150013_R1.fastq and BO150013_R2.fastq, respectively.
The script should report the following for each sample ID:
• the number of read pairs, as well as,
• the mean quality values per read for all reads from that sample ID
• A text file with the first 10 and last 10 fastq records

In a [text] file named _stats and be located in a logic place in your project structure.
Extra bonus
Produce a log file (i.e., with the .log extension) with the number of read pairs without an expected barcode (i.e., barcodes not in Table 1). The output should list the number of reads per “unexpected” barcode along with the sequence of the unexpected barcode.
Table 1. Sample IDs and barcodes.

Sample ID
Barcode (P1)
BO150013
GCATG
BO150077
AAGTGA
BO150087
ATTACA
BO150098
AGAATGA
BO150106
AGTTAAT
BO150125
AGTGTTAA
BO150164
CACGACCA

Reporting
The report should be a “proper” report, with a title, name etc. and only with your own writing. It needs an introduction, presenting the overall goal as well as brief explanations of the individual steps in your python script (which itself should be richly commented). In addition, the overall file and directory structure of your project needs to be described and documented in a figure. NO SCREENSHOTS in report, appendices – NONE, nada, zilch! Use different font typefaces to make it intuitively clear in your report when you refer to variables, objects, directories etc. The final, fully commented, Python script should be supplied in a separate .py file, which can be executed without any adjustments (apart from setting up your required project directory structure, file locations etc.). In summary, the report should be complete, and with no “lazy” shortcuts, such as screen shots.

Related Posts