CSC 209 (http://www.cdf.toronto.edu/~csc209h/summer/) CONTENT (http://www.cdf.toronto.edu/~csc209h/summer/content/)
ASSIGNMENTS (http://www.cdf.toronto.edu/~csc209h/summer/assignments/)
ONLINE LABS (http://www.cdf.toronto.edu/~csc209h/summer/labs/) DEADLINES (http://www.cdf.toronto.edu/~csc209h/summer/deadlines/)
LINKS
Assignment 1: Using the Shell and Intro C
Follow the instructions carefully, so that we receive your work correctly.
Your first step should be to log into MarkUs (https://markus.cdf.toronto.edu/csc209-2016- 05) using your cdf id and password. Select the assignment A1: Using the Shell and Intro C. You will be working alone on this assignment, and the URL for your SVN repository can be found on the right hand side of the page. Use that URL to check out your repository. You will find an A1 directory into which you should put your work.
It is important that you follow the instructions carefully, so that your work is correctly recorded. Remember, you must do all of the
steps on a CDF machine from your own account. You can do it from home if you are logged into a CDF server (using NX, Putty, ssh).
1. Checkout your svn repository
Create a directory in your home directory called csc209. Change into that directory and check out your SVN repo using the repository link from MarkUs.
2. Copy input files
Input files for all tasks are in this tar archive: A1_inputs.tar (A1_inputs.tar). Copy this file into your A1 directory, and untar it. You should see the following input files: ca-500.csv
3. Test svn
Create file readme.txt with a single line: “A1”. Add this file to your svn repository using
svn add readme.txt . Then commit it using
svn commit -m ‘First commit’ . Once you have committed files to your repository, you can use MarkUs to make sure that what you committed is what we got. The Submissions tab for the assignment will show you what is in the repository, and will let you know if you named the files correctly. You will not be able to submit files through the MarkUs web interface for assignments.
If you need more information about svn, you
can find it in this tutorial (http://maverick.inria.fr/~Xavier.Decoret/resources/svn/).
PART 1. USING SHELL
COMMANDS
In this part we are going to use existing command-line tools to extract some meaningful information from files. To accomplish that you would need to review some shell commands. We have learned most of them in class, but you may need to find out some additional options, which you can do by reading the relevant man pages. At first, you may find reading man pages difficult because they are written for a technical audience, but it will pay off in the future because you will be able to access the complete and correct information about every command.
Hint: commands for part 1
head : extracts first n rows from a text file. Default value is 10 rows, find out how to specify the number of rows as a parameter.
tail : same as head only for last n rows.
cut cut: cuts out a number of columns from a tabular file. We need to know how to specify column numbers and how to specify a field delimiter.
grep : we need to know how to match the entire word
sort : we need to know what parameters we need to perform numeric sort, sorting by fields and specifying field delimiters, sorting more than one file.
uniq : used for both removing and finding duplicate lines in a file.
wc : we are interested in the option which
count lines
We can chain commands together by piping an output of one command as an input to another by using operator | . Using only limited set of simple commands listed above we can explore the content of files. Complete the following tasks with a single-line sequence of commands.
Task 1. Extracting information from tables
We are going to work with file ca-500.csv, which is a sample address book with 500 Canadian contacts. The first row contains the names of each column. Each line except the first contains an anonymized contact information of a single person, and the different fields are separated by commas.
Write a single-line sequence of commands to:
1.1. Sort all rows by last name and then by first name (columns 2 and 1). This means that if there are ties in the last names, the order is determined by the first name. Redirect output to file q1_addresses_sorted. (Note: the output files have no extension). 1.2. Count number of contacts in Calgary. Redirect output to file q1_calgary.
1.3. Find out how many different unique cities are in this address book. Output this number into file q1_different_cities_count. 1.4. Print the list of all unique pairs of city – province. Redirect output to file
q1_cities_provinces. Submit:
The list of all commands in flat text file q1_ commands, and all output files from tasks 1.1 – 1.4: q1_addresses_sorted, q1_calgary, q1_different_cities_count, q1_cities_provinces. File q1_ commands should contain exactly 4 lines, one line per task, with all the pipes, parameters and file redirections that you used for achieving each result.
Task 2. Manipulating files with numbers
You are given two files A and B, where each line contains exactly one integer. Write a single-line chain of commands to complete each task listed below. Do not forget that you are dealing with NUMBERS now, rather than strings.
1.1. Find out how many times the number 1 occurs in file A. Do the same for file B. (1 command, 2 results).
1.2. Find the minimum and the maximum number in file A. Do the same for file B. (2 commands, 4 results)
1.3. Combine numbers in files A and B into a single list which contains all the numbers both from A and B but without duplicates. Redirect results to file UNION (1 command, 1
UPDATE:The new input files for this part, without junk line-endings are in AB.tar (AB.tar).
output).
1.4. Find out what set of numbers is common to both A and B . Redirect results to file INTERSECTION (1 command, 1 output).
Submit:
The list of all commands in flat text file q2_ commands (5 commands), results for questions 2.1 – 2.2 in file q2_ results (6 lines of results), and 2 output files: UNION and INTERSECTION.
PART 2. INTRO C
In this part you will develop a useful command- line tool for searching genomes demonstrating your mastery of loops, conditionals, arrays, and functions.
CpG Islands
Living organisms are defined by their genome, which we represent as strings constructed using a 4-letter alphabet {‘A’, ‘C’, ‘G’, ‘T’}. Only a small number of sequences in a genome encode for proteins, these are called genes. One way to zero-in on genes is to find stretches in the genome rich in CG sub-stings. These CG- rich regions are often referred to as CpG Islands. Your goal is to locate CpG islands in a given DNA string.
The CpG island can be formally defined as a region of at least 200 chars long, where both following conditions hold:
1. Total count of Cs + Gs is greater than 50%
of the site length.
2. The ratio of observed CG substrings to expected is at least 60%.
You will declare and define separate functions to compute these two values. The expected number of CG patterns is calculated for each region as (count (‘C’) * count (‘G’))/N, where N is the total number of characters in the site considered. The observed is just total count of CGs: count (‘CG’). So to find the observed to expected ratio we divide observed by expected.
Your goal is to find and report all CpG islands in a given DNA sequence. Sample file dna.txt represents a short DNA sequence from a genome of Norway rat.
UPDATE: Because it turned out that almoust entire input sequence in dna.txt comes from a CpG island, to ensure that your program handles all different usecases, please also test it on the DNA sequence of Human chromosome 21 which you can download from here (chr21.fa.gz). You do not need to modify your code, just use a different input file. If you also count the total number of characters in all CpG islands, you will see that CpG islands cover only a small portion of a genome.
The input is given in the widely used FASTA format, and the first step is to convert the input into a char array. As we did not learn how to read files yet, the code for reading all bytes from a file and converting them into an array of valid DNA characters is provided.
Once you have your character array, you can process each window of length 200 characters to determine whether it is a part of CpG island.
You slide the window by one character at each iteration. To simplify the task, and because your sequence is small, we will use the following algorithm.
for each window of length 200, calculate if it belongs to a CpG island
if it does, set the corresponding 200 entries in array valid_cpg_positions to 1
when the entire sequence has been processed, combine valid CpG sites into larger islands looping over
valid_cpg_positions
print the start and end position of each CpG island that you discovered
Sample output:
The output consists of the table in the following format:
CpG island
CpG island
numislands
104..509 size=406
596..924 size=329
2
Each column is separated by a tab
Submit:
Source code in file cpg.c and the output of your program in file cpg_output.txt.
Starter code
#include
//function declarations
int valid_chars_count (char *input, int size);
)){
int valid_char (char c);
int main (int argc, char *argv[]) {
char * inputfile_name=argv[1];
FILE * input_fp;
char *input_buffer; //stores all bytes in the i
nput file
char *sequence; //char array holding the fin
al cleaned DNA sequence
int *valid_cpg_positions; //array of all positi ons in sequence
int input_len, seq_len; int i, j;
int r;
//open the file
if ( !(input_fp= fopen ( inputfile_name , “rb” )
printf(“Could not open file \”%s\” for r eading input lines \n”, inputfile_name);
return (-1); }
//compute total number of characters in the input file
fseek (input_fp, 0, SEEK_END); input_len=ftell (input_fp);
rewind (input_fp);
printf (“total bytes in file: %d\n”, input_len);
//allocate an array to hold these characters
input_buffer = (char*) malloc (input_len+1);
//read file content into the input buffer
r = fread (input_buffer, 1, input_len, input_f
p);
printf (“read characters %d\n”, r); if (r != input_len) {
printf(“Reading error \n”);
return (-1); }
//add terminating character
input_buffer [input_len] =’\0′;
//determine total number of valid DNA char acters
//and allocate array of chars to hold them
seq_len = valid_chars_count (input_buffer, i nput_len);
printf (“total characters %d total valid char acters %d \n”, input_len, seq_len);
sequence = (char*) malloc (seq_len+1);
//transfer valid characters from raw buffer
for (i=0, j=0; i < input_len; i++) {
if (valid_char (input_buffer [i])) {
sequence [j++] = input_buffer [i]; }
}
sequence [seq_len] = '\0';
//allocate int array for all the positions
valid_cpg_positions = (int*) malloc (seq_len *sizeof(int));
for (i=0; i