Risk prediction using whole genome data.
Accurate disease risk prediction is an essential step towards early disease detection and personalized medicine. Biotechnological advances have generated massive amount of high-dimensional genomic data, which provides valuable data resources for risk prediction research. However, high dimensionality of whole-genome data poses great analytical challenges for prediction modelling. The traditional paradigm of using ‘top genetic variants’ selected on the basis of p-values is unlikely to lead to an accurate prediction model, because the lion’s share of small-effect genetic variants from whole genome sequencing data that fail to meet the pre-selection criteria are overlooked. Penalized techniques become computationally expensive when applied to the whole genome sequencing data with millions of potential predictors being measured.
A fundamental concept in quantitative genetics is that of linking genetic profiles to disease outcomes through genetic similarity among individuals, and this concept has been exploited within linear mixed effect model using whole genome data. Through transforming millions of genetic variants into a subject-similarity measure, these methods reduce the data dimension substantially. However, the similarity derived from all genetic variants, which contain a massive amount of noise information, may not be sufficient for an accurate risk prediction model. Moreover, most of these methods are designed for normally distributed outcomes, and it is not trivial to extent to outcomes with other distributions (e.g. binary). A commonly used approach in genetic research is to treat all the outcomes as if they were normally distributed. Though successful in practice, its reliance on inaccurate probabilistic models is expected to result in suboptimal performance.
In this project, we are interested in evaluating the performance of gBLUP(or rrBLUP) for outcomes from the exponential family of distributions through simulation studies and a real data application (possible from ADNI). We will mainly focused on non-Gaussian outcomes. We will compare the common practice in genetics (i.e. treated them as if they were normally distributed) with the ‘right model’ where the likelihood function is correctly specified according to the underlying distribution. We will use mean square error/Pearson correlation and area under the curve to evaluate the performance for continuous and binary outcomes, respectively. The ultimately goal of this study is to provide practical recommendations for risk prediction modelling using mixed effect models.
In terms of simulation plan, below I listed what I think we should do:
1) Generate normally distributed outcomes using additive models and also models with interaction terms.
2) Generate binary outcomes according to additive model with various effect sizes.
3) Generate outcomes with Poisson distribution according to additive genetic models with various effect sizes.
4) Generate outcomes with Exponential distribution according to additive genetic models with various effect sizes.
In terms of real data evaluation, below is what I think:
1) Using gBLUP/rrBLUP for one roughly normally distributed outcomes from ADNI study (possible FDG or Fusion or Learning).
2) Using gBLUP/rrBLUP for binary outcomes (AD from ADNI)
3) Using gBLUP/rrBLUP for skewed outcomes (some brain measure from ADNI)