THE UNIVERSITY OF AUCKLAND
FIRST SEMESTER, 2020 — FINAL ASSESSMENT Version 1
Campus: VIRTUAL
COMPUTER SCIENCE
Copyright By PowCoder代写 加微信 powcoder
Computational Science (Time allowed: THREE hours)
This assessment has 11 questions, and it is worth 60 marks in total.
Answer all questions in a single document and upload it as a pdf file to Canvas, as you have done for the assignments.
This test has been designed as an open-book and open-web three hour test. However you have a full 24 hours to work on it. You are allowed to use the full 24 hours.
When a question requests to explain your answer, that means you need to justify how you came up with the solution or why you made a certain choice. Be concise and clear.
The work must be your own. You are not allowed to copy text or code or solutions. Use your own words when answering the questions.
If you wish to raise concerns during the Final Assessment, please call the Contact Centre for advice: Auckland: 09 373 7513, Outside Auckland: 0800 61 62 63, Interna- tional: +64 9 373 7513
It is your responsibility to ensure your assessment is successfully submitted on time. Please dont leave it to the last minute to submit your assessment.
For any Canvas issues, please use 24/7 help on Canvas by chat or phone.
COMPSCI 369
QUESTION SHEET 2 COMPSCI 369
Section A: Integrity Statement
By completing this assessment, I agree to the following declaration:
I understand the University expects all students to complete coursework with integrity and honesty. I promise to complete all online assessment with the same academic integrity standards and values. Any identified form of poor academic practice or academic misconduct will be followed up and may result in disciplinary action.
As a member of the University’s student body, I will complete this assessment in a fair, honest, responsi- ble and trustworthy manner. This means that:
• I declare that this assessment is my own work.
• I will not seek out any unauthorised help in completing this assessment.
• I am aware the University of Auckland may use plagiarism detection tools to check my content.
• I will not discuss the content of the assessment with anyone else in any form, including Canvas, Piazza, Facebook, Twitter or any other social media or online platform within the assessment period.
• I will not reproduce the content of this assessment anywhere in any form at any time.
• I declare that I generated the calculations and data in this assessment independently, using only the
tools and resources defined for use in this assessment.
• I will not share or distribute any tools or resources I developed for completing this assessment.
Section B: Computational Biology and Numerical Integration Characteristics of systems typically studied in computational biology
1. Explain in your own words the following characteristics.
(a) heterogeneity
(b) non-linearity
(c) high-dimensionality
[1 mark] [1 mark] [1 mark]
2. Explain in your own words why computational biology often focuses on systems that have these properties. [2 marks]
3. For each of these three characteristics (heterogeneity, non-linearity and high-dimensionality), give an example of a system that has the property and an example system that does not. Justify each example in 1–2 sentences. The examples should be your own (i.e., they should not come from the internet or from the course slides). They can come from biology, but they don’t have to. [6 marks]
QUESTION SHEET 3 COMPSCI 369
Section C: Genomics, Sequence Alignments and Assembly
4. Assume that you are given a DNA of a gene with no splice variants (no introns), as represented by thefollowingsequence5’ TTATTTCATTT 3’
(a) Illustrate the main information flow of the Central Dogma of Molecular Biology on the given DNA sequence [1 mark]
(b) How many open reading frames do you have to consider? Please provide all of them and explain how you obtained them. [2 marks]
(c) Which one codes for a protein and explain why. What is the sequence of that protein? Please explain how you obtained the protein sequence. [1 mark]
5. In the second assignment you implemented an algorithm for global and local alignment separately.
(a) Now imagine you are only given the function global() that solves the global alignment problem (returns the optimal score and the optimal alignment for a given pair of sequences and given substitution matrix). Provide a pseudo code that will solve the local alignment problem using the global() function, while not modifying global(). Please provide meaningful pseudo-code comments and text-based explanation of the main idea.
(b) What will be the time complexity of this algorithm? Is it more efficient than the local align-
ment method we introduced in the lecture? Explain your answers.
6. You are given the following set of 3-mers {GGG, ATG, GGT, GTG} and you want to recon- struct the DNA sequence that can be described by this 3-mer composition. For this purpose, we introduced two graph-based approaches that utilise paths in graphs. For each approach show step- by-step how it can be applied to solve this sequence reconstruction problem.
In particular, explain
(a) what the graph (nodes and edges) represent, [1 mark; 0.5 marks per approach]
(b) how you constructed the graph, [3 marks; 1.5 marks per approach]
(c) how you obtained the sequence from the graph, and show all the sequences you could reconstruct for the problem at hand. [2 marks; 1 mark per approach]
QUESTION SHEET 4 COMPSCI 369
Section D: Simulation, HMMs and Trees
What is the distribution of the random variable that the randomVar pseudo-code below generates given that randomUniform produces a uniformly distributed random variable between 0 and 1? Explain your answer.
(b) Write a pseudo-code method biasedRandWalk(k, p) that simulates a random walk of length k starting at 0 where steps of +1 occur with probability p and steps of -1 occur with probability 1-p. The method returns an array showing the position of the walker after each of the k steps. Explain how you would use this method to estimate the probability that the walker is in a position ≤ −100 after 2000 steps for various values of p. [2 marks]
Certain regions of a DNA sequence close to regions that code for genes are richer than expected in the bases C and G. Suppose we model regions of the sequence as being either rich or poor in Cs and Gs. We use an HMM with two states, P for poor, R for rich. The probability that state P is followed by state R is 1%. The probability that state R is followed by state P is 2%. The process starts in either state with equal probability.
Each state can emit any of the 4 bases A,C,G or T. Emission probabilities are given by
ACGT P 0.3 0.2 0.2 0.3 R 0.15 0.35 0.35 0.15
(a) Sketch a diagram of the HMM, showing all states, possible transitions and transition proba- bilities. Include the begin state. [2 marks]
(b) What is the joint probability P(x,π) of the state sequence π = PPRR and the symbol se- quence x = TCCG? Leave your answer as a product or sum of numbers. [1 mark]
(c) One version of the Baum-Welch algorithm uses the Viterbi algorithm. Describe what the Baum-Welch algorithm does and how the Viterbi algorithm is used within it. What would the updated parameter estimates be for x = TCCG and π∗ = PPRR? Explain why you use pseudo-counts in this procedure. [3 marks]
(d) HowsuccessfulwouldyouexpecttheBaum-Welchalgorithmtobeforthismodelforagiven symbol sequence of length 1000? Give reasons for your answer and describe one model or data change which would improve its success? [1 mark]
randomVar(rate, n)
for i in 1:4
var = var – log(1-randomUniform())/rate
return var
QUESTION SHEET 5 COMPSCI 369
(a) (b) (c)
(e) 10. (a)
(c) 11. (a)
Construct a UPGMA tree from D showing your working.
Are the distances in D ultrametric? Explain why or why not.
(e) Complete the four missing entries in the Viterbi matrix below and draw in the traceback pointers. Remember to show your working.
9. Let the symmetric matrix
D=420
[2 marks] [1 mark]
specify the pairwise distances, Dij , between the four sequences x1 , . . . , x4 .
What biological assumptions are we making by assuming that UPGMA will construct the
correct tree and how realistic are they in general? [2 marks]
Perform the first step in the neighbour-joining algorithm on D (up to where the first pair of neighbours i, j are joined and the distances of i and j to the new node are calculated). [2 marks]
Will neighbour-joining reconstruct the correct tree in this case? When would we prefer neighbour-joining over UPGMA? [1 mark]
Draw a 5-taxon unrooted tree with taxon labels a, b, c, d, e and make up an alignment with se- quence length 4 that would support the tree topology that you show. Show on which branches mutations would occur to give the parsimonious explanation for the alignment. Which posi- tions in the alignment are parsimony informative and which are not (make sure that there is at least one of each type)? [3 marks]
What is the parsimony score for the tree and alignment in part (a)? Draw another tree for the same alignment that has a different parsimony score and write down the score for this tree. [1 mark]
Describe why finding the maximum parsimony tree is hard and why, even if it were easy to find it, it may not be very helpful. [2 marks]
What are the advantages of using a likelihood based method (whether it be maximum likeli- hood or a Bayesian approach) to reconstruct a tree from a given sequence alignment D over parsimony or distance based methods? [2 marks]
What are the disadvantages of a likelihood based approach to tree reconstruction? [2 marks]
ForagivenalignmentD,treegandmutationparametersθ,howdowefactorisethelikelihood P (D|g, μ) to make its calculation tractable? Explain which assumptions of independence are being made at each step. [3 marks]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com