Can we take a mathematical biology approach to high-throughput data?
Classical mathematical biology and high-throughput bioinformatics share the goal of understanding mechanisms of develop and disease, health and aging. Their methods, however, could hardly be more distinct. Efforts in mathematical biology typically involve detailed study and modeling of a small, distinct biological subsystem. Modeling can be driven by qualitative data or quantitative data--sometimes very high quality quantitative data, but typically limited to a small number of entities (e.g. genes). In contrast, high-throughput bioinformatics studies system-wide measurements (e.g. of gene expression, DNA sequence or state, molecular interactions). However, the data is noisy and no individual findings can be trusted without further validation. Part of the problem is the classical "n versus p" dilemma of bioinformatics--typically we are concerned with thousands or tens of thousands of entities, while the number of samples we have to work with may be just a few or up to a few tens or hundreds. However, the recent grand-scale efforts of international consortia such as ENCODE, TCGA, and IHEC are producing so much data that sample numbers are finally approaching or even vastly exceeding entity numbers. This raises the hope that noise can be averaged away and accurate answers can be obtained from high-throughput data. Yet, analysis of these data has largely been done through older approaches designed for smaller sample sizes. I will describe several recent studies in my lab showing that established bioinformatic approaches for analyzing, for example, high-throughput RNA-seq or ChIP-seq data, do not scale well to ultra-large datasets. Not in the sense that they are computationally too slow. But in the sense that these algorithms do not extra as much information as we believe they should when offered much larger number of samples. The problem as we see it is that--in contrast to classical mathematical biology--the algorithms treat every entity identically. Instead, we propose a new mantra for the creation of high-throughput bioinformatics algorithms for ultra-large studies: The algorithms must be general, but each entity must treated as unique. In that context, I will describe several new approaches we have been developing for extending RNA-seq and ChIP-seq analyses in ways that take advantage of ultra-large datasets, that make predictions conditional on specific features of each entity, and that, we hope, offer a way towards more powerful and statistically robust extraction of biological information from these incredible resources.