Problem Set 5: The cycle of twelve
Sand mouse cells have a circadian oscillator that partly depends on transcriptional regulation. The mRNA expression
Copyright By PowCoder代写 加微信 powcoder
levels of twelve sand mouse genes are known to vary cyclically in a 24 hour period, in different phases. However, their
phases relative to each other remain unknown. The phase information is critical. If the 12 phases were known, then the 12
genes could be ordered into a relative order of activation. This would make testable predictions about which gene
products are most likely to be ‘upstream’ in the regulatory pathway and directly regulating transcription of which
‘downstream’ genes.
Available data
Of course, you could do RNA-seq experiments at different time points in entrained (synchronized) sand mouse cells, but
it would take time and money, and you and your lab mates are eager to leap ahead quickly to mechanism. By searching
the literature and the GEO database, you’ve found that there are already a few published RNA-seq experiments in
entrained sand mouse cells, at a small number of time points. You’ve collated these data into a table (pset5-data.tbl):
There are a total of 8 experiments, with two experiments each at time points at 4, 8, 16, and 24 hours, as marked in the
first line of bold column headers. The numbers in the table are estimated expression levels in TPM for the 12 genes.
https://en.wikipedia.org/wiki/Circadian_clock
https://www.ncbi.nlm.nih.gov/geo/
Figure 1. An example of a sand mouse gene being expressed in a circadian rhythm, with a period of 24 hours. The true
average oscillation in expression level is shown with the black line. The eight available experiments have measured this
gene’s expression at 4, 8, 16, and 24 hour time points. Two experiments are accurate with low variance (green dots); four
experiments are inaccurate with high variance (red dots); two experiments are intermediate (yellow dots).
The catch is that the experiments have different reliabilities, as indicated in the second line of bold column headers
marked +-20, +-2, and +-5. This variability is partly because different numbers of replicates were done, and partly
because some experiments were more careful than others. For the purposes of the exercise, we will assume that the
errors in TPM measurements are Gaussian-distributed with these known standard deviations . For example, the first
experiment is a 4hr time point, with normal error . There are two high accuracy experiments with at
4hr and 24hr points; four low accuracy experiments ( ); and two moderate accuracy experiments ( ).
You can assume that the underlying model for an cyclic expression level for a gene at time (in hours) is
in terms of three parameters: a baseline mean expression level (in TPM), an oscillation amplitude (in TPM), and a
phase (offset, in hours). You can treat the circadian frequency as known, . Note that the term ends
up in units of radians (That’s a hint. Don’t confuse hours and radians in implementing this exercise).
The game is to deduce the unknown phases , given this small amount of published RNA-seq data (pset5-data.tbl)
of variable reliability, and thereby deduce the correct relative order of the 12 genes. (If mRNA1 has a +2hrs of mRNA2,
that means mRNA1 starts to rise two hours before mRNA2 does, and thus is more likely to be a direct regulator of mRNA2
transcription, as opposed to the other way around.)
Moriarty’s solution
Moriarty, the brilliant postdoc in the lab who’s a bit full of himself, says he’s seen this problem before — it’s harmonic
regression. He’s familiar with a clever and widely used trick for it. The equation above for is nonlinear in (because of
the ), but Moriarty remembers his angle addition formulas from trigonometry, one of which is:
N(0, σ = 20) σ = 2
σ = 20 σ = 5
= b + a sin(2πω(t + ϕ))yt
2πω(t + ϕ)
sin(α + β) = sin α cos β + sin β cos α
If we apply that to the modeling equation for we can get a linear regression function with three parameters:
where those three parameters are:
The function is nonlinear in , but it’s linear in the parameters, which is what matters for solving for . Moriarty says, once
we’re in this form, we can just solve for the parameters with ordinary least squares. Given a least squares fit for the
parameters, and we can use to solve for ; then given , we can get from either or .
Moriarty’s script (moriarty.py), takes the data table as input, and does an ordinary least squares fit for each gene’s
expression data to obtain best fit , , and . Ordering the results by , Moriarty deduces that the pathway order
appears to be:
But even as Moriarty goes around the lab crowing about his solution, designing incisive new experiments based on his
brilliant deduction of the order (or so he says), you’re not quite convinced. The uneven reliability of the eight experiments
bothers you. You’re not sure the implicit assumptions of least squares fitting have been met.
1. Solve by maximum likelihood
Given the RNA-seq data (pset5-data.tbl), use maximum likelihood to solve for the unknown phases .
For each gene independently, you’ll need to calculate a log likelihood , where the observed data are the
expression levels in the 8 experiments, and the parameters include the unknown parameters , , and you want to find,
plus the known parameters for each experiment.
= + sin(2πωt) + cos(2πωt)yt p0 p1 p2
= a cos(2πωϕ)
= a sin(2πωϕ)
= tan(2πωϕ)
arctan ϕ̂ ϕ̂ a ̂ p1 p2
b ̂ a ̂ ϕ̂ ϕ̂
log P(data ∣ θ)
Then you want to identify ML estimates , , and for each gene, by identifying the parameters that maximize the log
likelihood.
You could discretize (as we’ve done before) and enumerate a grid search over a range of parameter values, but with three
parameters in play, that’d get to be tedious. Instead, this week it’s time to learn a new trick: solve this one using
multidimensional optimization using SciPy’s optimize.minimize() function.
The interface to calling any general library’s optimization routine (SciPy or otherwise) is always finicky. You give it your
objective function (here, your log likelihood calculation) in a general form, so SciPy can call it blindly and doesn’t need to
worry about what anything means. All it needs to do is start from an initial guess at a parameter vector (provided by
you) and move around in parameter space, calling the objective function you provided every time it moves to a new
point, until it finds a point such that your objective function reaches an optimum. There’s a ton of art in making
optimizers work properly, but since that’s been done for you in SciPy, essentially all the art (from your point of view) is in
figuring out the interface to optimize.minimize(). Documentation for it is here, and see the hints at the end for an
2. Compare solutions
Moriarty is incensed that you have a different solution. He offers to bet you on who’s right. Compute the total log
likelihood (summed over all 12 genes) for Moriarty’s solution and yours. Which is more likely, and by how much?
3. Plot the fits
For each of the 12 genes, plot the observed data and the two fits (so you can show Moriarty how he went wrong).
(You can use something like a 3×4 multipanel figure in matplotlib.)
optimize_example.py is a heavily commented example of using scipy.optimize.minimize() on a simple
maximum likelihood optimization.
a ̂ b ̂ ϕ̂
p̄ f ( )p̄
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com