COMP 90016 Assignment 2

Department of Computing and Information Systems COMP 90016
Assignment 2
Release date: 29th of April 2015
Due date: 19th of May 2015 (5pm)

Introduction:

The work by Sturm et al titled “A Single SNP in an Evolutionary Conserved Region within Intron 86 of the HERC2 Gene Determines Human Blue-Brown Eye Color” (accessible online at http://www.sciencedirect.com/science/article/pii/S0002929707000407) describes that eye colour in humans can be determined with reasonable confidence by single SNPs: two highly predictive variants named rs12913832 and rs1129038 are explored in the paper. In this assignment you are going to analyse DNA sequencing data upon the status of these SNPs in multiple samples to estimate the eye colour of the sequenced individuals.

The SNPs in question are located in the HERC2 gene on the human chromosome 15. The exact locations in the human reference are chr15:28365618 for rs12913832 and chr15:28356859 for rs1129038.

Tasks:

You are going to analyse sequencing data from three different (human) individuals. The data are accessible on the LMS and in the comp90016 folder on the MSE unix servers (nutmeg, digitalis, dimefox – in /home/subjects/comp90016/assgs/assg2/). You are free to work with the data in our lab environment or download the sequencing data to your own computer. There are three sets of data each consisting of a bam file (.bam) and a bam index (.bai): sample1.bam (.bai), sample2.bam (.bai), and sample3.bam (.bai).

Task 1. Write a python program that takes following inputs from the command line:

  1. A filename (expecting aligned sequencing data in form of a bam-file).
  2. A genomic coordinate (expecting a SNP coordinate) in the form

    chr:position.

The program should use the pysam library to investigate the following properties and present them as output to the command line:

c. How many reads in the input bam file (a) overlap the genomic coordinate specified as the input value (b)?

  1. Which nucleotides (A, C, G, T, N) are supported by those reads, and how many times each?
  2. What is the strand support for each base (what proportion of reads maps to the forward strand of the reference)?

The program should perform sensible error checking regarding the existence and readability of the provided filename and the correct format for the provided genomic coordinate.
An example run for this program may (but does not have to) look like this:

 >python snp_program.py sampleX.bam chr15:28365618
    There are 4 reads overlapping the coordinate at
chr15:28365618.
    The reads support the following nucleotides:
      A: 0
      C: 2 (of these 50% map to the forward strand)
      G: 2 (of these 100% map to the forward strand)
      T: 0
      N: 0

Task 1 is worth 10 marks.

Task 2. For each data sample you are to discuss the output of your program for the two SNP locations provided in the introduction:

  1. Present a short summary of the output.
  2. Be sure to include your own inference about the SNP status

    (present/absent, heterozygous/homozygous).

  3. Argue what the most likely eye colour for each of the samples is

    (utilize Table 3 in the paper) and how confident this assertion is.

  4. Discuss why your answer in 2c. is only true with a certain probability

    (only one answer for all of the samples).

Task 2 is worth 5 marks.

Notes:

  • –  Aligned sequence data in the SAM/BAM format are always presented according to the forward strand of the reference genome. The HERC2 gene is on the reverse strand. This may affect the genotype in your results, which may have to be complemented accordingly.
  • –  The aligned sequence data is only a small subset of reads from a larger file. There is only coverage in the region of the two SNPs analysed here. Keep this in mind when testing your program on general coordinates within the genome, as there are only reads present in the region chr15:28356359- 28366118.
  • –  There are inconsistencies in the coordinate systems of different annotation data (NCBI, UCSC, GenBank, etc.) and different libraries (BioPython, Pysam, Picard, etc.). In particular, pysam works with 0-based coordinates. The coordinates given in the introduction are 1-based, and your program is expected to accept them as such.
  • –  The integrative genome viewer IGV can be helpful to look at the sample data and verify your results.
  • –  If you develop the program for task 1 one your own computer, make sure that it also runs on the MSE machines, as this will be the environment we test in.