(584z) Robust In Silico Disease Classification Via Disease- and Procedure-Independent Optimization Models Using Quantitative MS1 Data From High-Throughput Proteomics

Guzman, Y. A., Princeton University
Floudas, C. A., Princeton University
Riley, C. P., Pathology Associates Medical Laboratories

A biomarker is a measurable characteristic that can indicate biological state relative to disease stage or contraction risk. Biomarkers have great potential to transform diagnostic medicine by serving as an early indicator or predictor of developing or oncoming disease. The monitoring of changes in the protein domain has the greatest potential to lead to the discovery of biomarkers due to the close relationship between protein expression and abundance and cellular state [1]. The need for rapid diagnostic turnover in time-sensitive disease systems has yielded an explosion of biomarker discovery research utilizing high-throughput mass spectrometry proteomics protocols within the past decade.

Efforts to discover and define robust protein biomarkers have yielded disappointing results. From 1997 to 2006, about 224,000 biomarker papers were published, while only 15 biomarkers were approved for use by the FDA [2]. The initial stage of biomarker discovery studies typically consists of a small amount of samples, and what seems to be a distinguishing protein biomarker often yields a high amount of false positives and false negatives at later validation stages with larger sample sizes and greater subject heterogeneity [1,3]. Focus has shifted to utilizing panels of biomarkers to create a set of diagnostic rules.

The complexity of biological samples and fluids makes the application of typical high-throughput proteomic analysis exceedingly difficult. The most enticing biofluidic source of protein biomarkers remains the blood and blood plasma for its clinical accessibility, but they display a dynamic concentration range spanning up to 11 orders of magnitude, with 99% of protein mass coming from 22 blood proteins [4-6]. Typical data-dependent acquisition for MS/MS fragmentation will exclude low-abundance proteins whose up- and down-regulation may capture disease response and treatment progression, and its stochastic nature reduces run-to-run reproducibility. These difficulties limit the penetrating depth of untargeted MS/MS protein identification protocols. In response, many biomarker discovery studies have focused on classification using MS1 features, resulting in very high sensitivity and specificity; these studies have also elicited criticism [7-9], as statistical methods, machine learning techiques, and black-box models are prone to over-training and can magnify features that are actually data artifacts [9,10].

Building on a previous study in which mixed-integer linear optimization models were proposed to classify healthy and diseased samples [11,12], we propose a novel class of robust optimization models that can fingerprint and classify healthy and diseased samples given quantitative MS1 data. These models can simultaneously select the optimum subset of distinguishing MS1 peaks while performing parameter estimation. The resulting functions are of diagnostic utility, quantitatively classifying new blind samples given only MS1 data. The new classification models are general and independent of sample biofluid, experimental protocol, and disease system. The optimal peak subset yields a multiple reaction monitoring protein identification protocol for further sample characterization and biomarker investigation. Results from the proposed models are presented as applied to MS1 data of proteomics samples collected from different biofluids, subjected to different experimental protocols, and relating to vastly different disease systems, including plasma samples collected from breast cancer patients and gingival crevicular fluid samples collected from patients with chronic periodontitis [13,14].

[1] Rifai N., Gillette M.A., Carr S.A. Nature Biotechnology, 24(8):971-983, 2006.
[2] Jin G., Zhou X., Wang H., Wong S.T.C. The Challenges in Blood Proteomic Biomarker Discovery. In Pham T., Computational Biology: Issues and Applications in Oncology. New York: Springer, 2009.
[3] Srinivas P.R., Verma M., Zhao Y., Srivastava S. Clinical Chemistry, 48(8):1160-1169, 2002.
[4] Anderson N.L., Anderson N.G. Molecular & Cellular Proteomics, 1(11):845-867, 2002.
[5] Schiess R., Wollscheid B., Aebersold R. Molecular Oncology, 3(1):33-44, 2009.
[6] Veenstra T.D., Conrads T.P., Hood B.L., Avellino A.M., Ellenbogen R.G., Morrison R.S. Molecular & Cellular Proteomics, 4(4):409-418, 2005.
[7] Sorace J.M., Zhan M. BMC Bioinformatics, 4:24, 2003.
[8] Poste G. Nature, 469(7329):156-157, 2011.
[9] Rogers M.A., Clarke P., Noble J., Munro N.P., Paul A., Selby P.J., Banks R.E. Cancer Research, 63(20):6971-6983, 2003.
[10] He Z., Yu W. Computational Biology and Chemistry, 34(4):215-225, 2010.
[11] Baliban R.C., Sakellari D., Li Z., Guzman Y.A., Garcia B.A., Floudas C.A. Journal of Clinical Periodontology, 40(2):131-139, 2013.
[12] Baliban R.C., Dimaggio P.A., Plazas-Mayorca M.D., Garcia B.A., Floudas C.A. Journal of Proteome Research, 11(9):4615-4629, 2012.
[13] Riley C.P., Zhang X., Nakshatri H., Schneider B., Regnier F.E., Adamec J., Buck C. Journal of Translational Medicine, 9:80, 2011.
[14] Baliban R.C., Sakellari D., Li Z., DiMaggio P.A., Garcia B.A., Floudas C.A. Journal of Clinical Periodontology, 39(3):203-212, 2012.