程序代写 Problem Set 5: The cycle of twelve

Problem Set 5: The cycle of twelve

Sand mouse cells have a circadian oscillator that partly depends on transcriptional regulation. The mRNA expression

levels of twelve sand mouse genes are known to vary cyclically in a 24 hour period, in different phases. However, their

phases relative to each other remain unknown. The phase information is critical. If the 12 phases were known, then the 12

genes could be ordered into a relative order of activation. This would make testable predictions about which gene

products are most likely to be ‘upstream’ in the regulatory pathway and directly regulating transcription of which

‘downstream’ genes.

Available data

Of course, you could do RNA-seq experiments at different time points in entrained (synchronized) sand mouse cells, but

it would take time and money, and you and your lab mates are eager to leap ahead quickly to mechanism. By searching

the literature and the GEO database, you’ve found that there are already a few published RNA-seq experiments in

entrained sand mouse cells, at a small number of time points. You’ve collated these data into a table (pset5-data.tbl):

There are a total of 8 experiments, with two experiments each at time points at 4, 8, 16, and 24 hours, as marked in the

first line of bold column headers. The numbers in the table are estimated expression levels in TPM for the 12 genes.

https://en.wikipedia.org/wiki/Circadian_clock
https://www.ncbi.nlm.nih.gov/geo/

Figure 1. An example of a sand mouse gene being expressed in a circadian rhythm, with a period of 24 hours. The true

average oscillation in expression level is shown with the black line. The eight available experiments have measured this

gene’s expression at 4, 8, 16, and 24 hour time points. Two experiments are accurate with low variance (green dots); four

experiments are inaccurate with high variance (red dots); two experiments are intermediate (yellow dots).

The catch is that the experiments have different reliabilities, as indicated in the second line of bold column headers

marked +-20, +-2, and +-5. This variability is partly because different numbers of replicates were done, and partly

because some experiments were more careful than others. For the purposes of the exercise, we will assume that the

errors in TPM measurements are Gaussian-distributed with these known standard deviations . For example, the first

experiment is a 4hr time point, with normal error . There are two high accuracy experiments with at

4hr and 24hr points; four low accuracy experiments ( ); and two moderate accuracy experiments ( ).

You can assume that the underlying model for an cyclic expression level for a gene at time (in hours) is

in terms of three parameters: a baseline mean expression level (in TPM), an oscillation amplitude (in TPM), and a

phase (offset, in hours). You can treat the circadian frequency as known, . Note that the term ends

up in units of radians (That’s a hint. Don’t confuse hours and radians in implementing this exercise).

The game is to deduce the unknown phases , given this small amount of published RNA-seq data (pset5-data.tbl)
of variable reliability, and thereby deduce the correct relative order of the 12 genes. (If mRNA1 has a +2hrs of mRNA2,

that means mRNA1 starts to rise two hours before mRNA2 does, and thus is more likely to be a direct regulator of mRNA2

transcription, as opposed to the other way around.)

Moriarty’s solution

Moriarty, the brilliant postdoc in the lab who’s a bit full of himself, says he’s seen this problem before — it’s harmonic

regression. He’s familiar with a clever and widely used trick for it. The equation above for is nonlinear in (because of

the ), but Moriarty remembers his angle addition formulas from trigonometry, one of which is:

N(0, σ = 20) σ = 2

σ = 20 σ = 5

= b + a sin(2πω(t + ϕ))yt

2πω(t + ϕ)

sin(α + β) = sin α cos β + sin β cos α

If we apply that to the modeling equation for we can get a linear regression function with three parameters:

where those three parameters are:

The function is nonlinear in , but it’s linear in the parameters, which is what matters for solving for . Moriarty says, once

we’re in this form, we can just solve for the parameters with ordinary least squares. Given a least squares fit for the

parameters, and we can use to solve for ; then given , we can get from either or .

Moriarty’s script (moriarty.py), takes the data table as input, and does an ordinary least squares fit for each gene’s
expression data to obtain best fit , , and . Ordering the results by , Moriarty deduces that the pathway order

appears to be:

But even as Moriarty goes around the lab crowing about his solution, designing incisive new experiments based on his

brilliant deduction of the order (or so he says), you’re not quite convinced. The uneven reliability of the eight experiments

bothers you. You’re not sure the implicit assumptions of least squares fitting have been met.

1. Solve by maximum likelihood

Given the RNA-seq data (pset5-data.tbl), use maximum likelihood to solve for the unknown phases .

For each gene independently, you’ll need to calculate a log likelihood , where the observed data are the

expression levels in the 8 experiments, and the parameters include the unknown parameters , , and you want to find,

plus the known parameters for each experiment.

= + sin(2πωt) + cos(2πωt)yt p0 p1 p2

= a cos(2πωϕ)

= a sin(2πωϕ)

= tan(2πωϕ)

arctan ϕ̂ ϕ̂ a ̂ p1 p2

b ̂ a ̂ ϕ̂ ϕ̂

log P(data ∣ θ)

Then you want to identify ML estimates , , and for each gene, by identifying the parameters that maximize the log

likelihood.

You could discretize (as we’ve done before) and enumerate a grid search over a range of parameter values, but with three

parameters in play, that’d get to be tedious. Instead, this week it’s time to learn a new trick: solve this one using

multidimensional optimization using SciPy’s optimize.minimize() function.

The interface to calling any general library’s optimization routine (SciPy or otherwise) is always finicky. You give it your

objective function (here, your log likelihood calculation) in a general form, so SciPy can call it blindly and doesn’t need to

worry about what anything means. All it needs to do is start from an initial guess at a parameter vector (provided by

you) and move around in parameter space, calling the objective function you provided every time it moves to a new

point, until it finds a point such that your objective function reaches an optimum. There’s a ton of art in making

optimizers work properly, but since that’s been done for you in SciPy, essentially all the art (from your point of view) is in

figuring out the interface to optimize.minimize(). Documentation for it is here, and see the hints at the end for an

2. Compare solutions

Moriarty is incensed that you have a different solution. He offers to bet you on who’s right. Compute the total log

likelihood (summed over all 12 genes) for Moriarty’s solution and yours. Which is more likely, and by how much?

3. Plot the fits

For each of the 12 genes, plot the observed data and the two fits (so you can show Moriarty how he went wrong).

(You can use something like a 3×4 multipanel figure in matplotlib.)

optimize_example.py is a heavily commented example of using scipy.optimize.minimize() on a simple
maximum likelihood optimization.

a ̂ b ̂ ϕ̂

p̄ f ( )p̄

http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts