(740c) A Multi-Omic Classification Strategy for Incorporating Incomplete Datasets | AIChE

(740c) A Multi-Omic Classification Strategy for Incorporating Incomplete Datasets

Authors 


Advancements in high throughput technologies have led to the increase in multi-omic studies of human diseases. Online databases, including The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), provide platforms for researchers to access these datasets to validate novel bioinformatics and machine learning tools, uncover new disease markers, or identify disease mechanisms across molecular data types. However, due to sample quality or other experimental challenges, some fraction of samples are often unavailable for one or more 'omics data types. Many published multi-omic biomarker identification and classification algorithms require complete data for all 'omics data types. Particularly for small datasets, or datasets which require careful matching to controls for confounding effects, excluding data for samples where at least one molecular data type is missing can result in a loss of statistical power and poor classification performance. Here we present a decision tree-based multi-omic classification algorithm that incorporates samples for which one or more molecular data types is missing.

We constructed a multi-omics tree-based classification strategy to evaluate the effects of incorporating missing data, along with multiple data integration strategies. We evaluated the performance of our multi-omic classifier in multiple TCGA datasets that contained samples with missing data for a single data type. We compared each classifierâ??s ability to predict survival, treatment response and/or disease subtype, based on available clinical data. We compared the cross-validated classification performance of an identical classifier applied to both the full dataset (containing missing data) and the partial dataset containing only samples for which all data is available.Using this strategy, classification performance improves with incorporation of incomplete samples. Additionally, we systematically evaluated the effect of the fraction of incomplete data by simulating missing data from each dataset. We used these simulations to determine the threshold of missing data that can be tolerated without loss of performance. Based on results from multiple cancer datasets, the proposed multi-omic classification strategyprovides an efficient method for preserving statistical power in multi-omic biomarker studies with incomplete data.

Topics