QUESTION/ANSWER SHEET CompSci 369
Attempt all questions
THE UNIVERSITY OF AUCKLAND
FIRST SEMESTER, 2019 Campus: City
Copyright By PowCoder代写 加微信 powcoder
Computer Science
Computational Biology (Time allowed: THREE hours)
Use of calculators is NOT permitted.
Put your answers in the answer boxes provided below each question.
You may use the blank pages at the end of the exam script for scratch work, which will not be marked.
Section: Possible marks:
Total 18 28 14 60
Awarded marks:
SURNAME: FIRSTNAME: ID:
Page 1 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
Section A: Sampling and Markov models (18 marks)
1. What does it mean for a sequence of random variables X0 , X1 , X2 , . . . to have the Markov prop- erty? Express your answer in plain English and in mathematical notation. [2 marks]
2. Suppose you have a method randexp(λ) which returns a random draw from the exponential distribution with rate parameter λ. Write a pseudo-code method randpoiss(λ) for sampling a single Poisson random variable with rate parameter λ. [2 marks]
Page 2 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
3. Suppose you have a method randunif() which returns a random draw from the uniform distri- bution on the interval [0, 1]. Write a pseudo-code method randwalk(L) that returns a simulated random walk of L steps starting at 0. Steps are of size 1 with steps to the left or right being equally probable. [2 marks]
4. Describe how you could use the randwalk method in Question 3 to estimate how long a walker on average spends before first stepping outside the interval [−10, 10]. [1 mark]
Page 3 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
5. Suppose we model regions of DNA sequences as being either being rich in purines or rich in pyrimidines. We use an HMM with two states, U for purine rich and Y for pyrimidine rich. A U state is followed by a Y state 5% of the time, and Y is followed by U 10% of the time. The process starts in any state with equal probability.
Each state can emit any of the 4 bases A, C, G or T . Emission probabilities are given by
ACGT U 0.3 0.2 0.3 0.2 Y 0.1 0.4 0.1 0.4
(a) Sketch a diagram of the HMM, showing all states, possible transitions and transition proba- bilities. Include a begin state. [2 marks]
(b) What is the joint probability P (x, π) of the state sequence π = U U and the symbol sequence x = GT ? (You can leave your answer as a product or sum of numbers). [2 marks]
Page 4 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
(c) Suppose that the emission and transition probabilities of the HMM discussed here were un- known but that a state sequence, π, and a symbol sequence, x, had been observed. Illustrate how the parameters aUU and eU(C) could be estimated from π = UUUY and x = ACAA and explain why pseudo-counts are used in this procedure. [2 marks]
(d) The Baum-Welch algorithm uses an iterative process to estimate the parameters of an HMM. Explain why may it be necessary to run the algorithm multiple times on the same problem. [1 mark]
Page 5 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
(e) For the HMM discussed here and a sequence x of length L, what is the running time of calculating P (x) via the forward algorithm? What is the running time of the naive method of directly calculating the sum P (x) = π P (x, π)? [1 mark]
(f) What are the inputs and outputs of the Viterbi algorithm? Calculate the value of z in the Viterbi matrix and show which row (0, U or Y) the traceback pointer from that cell should point to. You can leave your expression for z as a sum or product of numbers.
0AC 0100 U 0 0.15 z
Y 0 0.05 0.018
Page 6 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
(g) The posterior probability of being in state k at i is Pr(πi = k|x). Explain why the path πˆ where each πˆi is chosen to maximise Pr(πi = k|x) is not necessarily a valid state path. [1 mark]
Page 7 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
Section B: Sequence Analysis and Bayesian inference (28 marks)
6. Consider the following alignment Z.
What is the score of Z if each match scores 2, each mismatch scores -1, the gap open penalty is -4, and the gap extension penalty is -1? [1 mark]
7. Give two modifications that are necessary to turn the Needleman-Wunsch algorithm into the algorithm. Explain what each modification does. [1 mark]
Page 8 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
8. The partially completed F matrix for calculating the local alignment of the sequences AGC and GACCT is given below. The score matrix is given by s(a, b) = −2 when a ̸= b and s(a, a) = 4 while the gap penalty is d = −3.
000000 A004100 G041200 C012wxy
(a) What does the (i, j)th entry of F represent?
(b) Complete the matrix F by finding values for w, x and y.
(c) Give the score for the best local alignment of these two sequences and provide an alignment
that has this score.
Page 9 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
9. Consider a phylogeny representing the ancestral relationships between species that were all sam- pled at the same time.
(a) If the genetic distance to the root of the tree is not the same for all leaves, what can we infer about the substitution process giving rise to these data? [1 mark]
(b) Why might we prefer to use neighbour-joining over UPGMA to reconstruct a tree? [1 mark]
(c) WhatfeatureoftheevolutionarytreedoesUPGMAestimatethatneighbour-joiningdoesnot? [1 mark]
10. Consider the four aligned sequences, W,X,Y, and Z:
(a) Which sites in this alignment produce different scores on different trees? What are sites
having this property called?
Page 10 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
(b) By enumerating the three possible unrooted trees on the four taxa and finding the parsimony score for each tree, identify the maximum parsimony tree for this alignment. [3 marks]
(c) Give two criticisms of relying on the parsimony criteria to find the best tree. [1 mark]
(d) What is the time-complexity of the small parsimony problem (Fitch algorithm) in terms of number of leaves in the tree (n) and the number of sites in the alignment (L)? [1 mark]
(e) Briefly explain how the maximum parsimony tree can be found. Can this be done in polyno-
mial time?
Page 11 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
11. (a) Write down Bayes’ formula for the posterior probability of a random variable θ given data
(b) In your answer above, which term is the likelihood and which term is the prior?
(c) What does the prior represent?
(d) Which term can be ignored when using the Markov chain Monte Carlo algorithm and why? [1 mark]
Page 12 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
12. Consider the following time tree (right) and branch rates (centre; with indices into rate vector indicated on time tree):
(a) What branch lengths (in expected substitutions per site) on the tree on the left are implied by this time tree and branch rates? Draw and label the tree or list the branch lengths in the same order as for the rate vector. [2 marks]
(b) Draw an unrooted tree with branch lengths consistent with the time tree and branch rates
Page 13 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
(c) Write down the posterior density of a time tree g and branch rates μ⃗, given a coalescent tree prior p(g|N), a log-normal prior on the rates p(μ⃗|M,S) and a phylogenetic likelihood P (D|g, μ⃗ ), in the form of Bayes’ formula. [1 mark]
(d) In your formulated posterior, which of g, N, D, μ⃗, M, S are unknown random variables, and which quantities are considered known? [2 marks]
(e) What information from the genealogy g is required to compute the coalescent tree prior? [1 mark]
(f) What extra information would be required beyond the sequence alignment (D) if we wanted to consider a structured coalescent prior? [1 mark]
(g) What extra parameters could be estimated from a structured coalescent model compared to a
single-population model?
Page 14 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
Section C: Transcription, Genome Assembly and Artificial life (16 marks)
The diagram below shows transcription of a DNA gene. Label the diagram to indicate the following features: i) the 2 DNA strands, ii) the mRNA strand, iii) the RNA polymerase, and iv) the direction of transcription elongation (ie. the direction that the protein moves in to copy the sequence). [2 marks]
(b) Assuming that RNA polymerase copies the sequence perfectly, what will be the next nu-
cleotide added onto the mRNA?
Page 15 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
Consider the three reads GCGT, GTGC, TGGC. List the six unique 3-mers contained in these reads. [3 marks]
Construct a de Bruijn graph from these 3-mers. What do the nodes in your graph represent and what do the edges represent? [2 marks]
How many Eulerian paths are there in this graph? [1 mark]
Briefly describe each of Wolfram’s four classes of cellular automata, using 1-2 sentences
Page 16 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
(b) The figure below indicates a set of update rules for a one-dimensional cellular automaton.
What Wolfram-class does this CA belong to? Justify your answer in one or two sentences [1 mark]
(c) The figure below indicates another set of update rules for a one-dimensional cellular automa- ton
Using these update rules, fill in the grid below, showing how the initial condition indicated by row t=0 evolves over 5 iterations. [1 mark]
(d) Given the result that you produced above, what Wolfram-class does the CA in Question 15c likely belong to? Justify your answer in one or two sentences. [1 mark]
Page 17 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
Blank page — will not be marked
Page 18 of 19
QUESTION/ANSWER SHEET CompSci 369 ID:
Blank page — will not be marked
Page 19 of 19
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com