(388e) Novel Biclustering Framework for Tertiary Contact Prediction Using Low-Homology Protein Templates

Smadbeck, J., Princeton University
Kieslich, C. A., Texas A&M University
Khoury, G. A., Pennsylvania State University-University Park
Floudas, C. A., Princeton University

The prediction of the three-dimensional structure of a protein from its amino acid sequence remains an open question in molecular biology, with important implications in biological engineering. For sequences with low-homology template structures, the ranking of template structures and generation of predicted structures can be particularly difficult. The development of accurate prediction of long-range amino-acid contacts can be important in ranking these low-homology templates for modeling. Additionally, these contacts can be useful in constraining dynamic simulations of protein structure of low-homology targets. To this end we have developed a novel tertiary contact prediction method based on biclustering analysis for extracting Cα distance constraints from low homology templates.

The initial step of our procedure is the identification of template structures for the target sequence using a modified threading algorithm of SPARKS-X [1]. Preliminary models, based on the top template structures, are generated using CYANA [2] in order to remove gaps in the structures resulting from unmapped regions of the alignments. To identify persistent structures and topologies within the templates, hierarchical clustering based on pair-wise GDT (a structure similarity measure) is performed using the initial CYANA models. The template structures belonging to the largest three clusters of the GDT-based dendrogram tree are selected for Cα-Cα distance calculation. This matrix of Cα-Cα distances then serves as input to OREO[3-5], which is an iterative framework for biclustering dense and sparse data matrices via optimal re-ordering of rows and columns.

The final step is to filter the clustered distances to exclude distances of low confidence. We apply three filters: (i) a variance filter, based on the mean/standard deviation of each distance; (ii) a sequence mapping filter, in which contacts involving poorly mapped positions are excluded, according to an accumulation of the position-specific scoring matrix (PSSM) produced by SPARKS-X; (iii) a structure-based filter, which removes distance constraints that deviate significantly from the predicted values during CYANA structure generation. The structure-based filter includes additional constraints derived from CONCORD[6] predicted secondary structure and predicted beta-sheet topology[7], which is essential for identifying conflicting constraints. The remaining clustered distances are considered strong candidates for conserved contacts and are used to generate Cα distance constraints for a final structure generation.

We present results on a series of free-modeling targets from the Critical Assessment of techniques for protein Structure Prediction 10 (CASP10) competition. This method demonstrates a superior performance over other low-homology template-based contact prediction methods in prediction short, medium, and long-range contacts for difficult targets. The contacts are used in the generation of structures through a constrained molecular dynamics (MD) run to demonstrate how such contacts are important for the accurate structural fold determination.

[1] Yang Y, Faraggi E, Zhao H, Zhou Y. Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics 27:2076-82 (2011)

[2] López-Méndez B, Güntert P. Automated protein structure determination from NMR spectra. J. Am. Chem. Soc. 128:13112–13122 (2006)

[3] DiMaggio PA, McAllister SR, Floudas CA, Feng XJ, Rabinowitz JD, Rabitz HA. Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics, 9:458 (2008)

[4] McAllister SR, DiMaggio PA, Floudas CA. Mathematical modeling and efficient optimization methods for the distance-dependent rearrangement clustering problem. J. Global Optim. 45:111-129 (2009)

[5] DiMaggio PA, McAllister SR, Floudas CA, Feng XJ, Li G, Rabinowitz JD, Rabitz HA. Enhancing molecular discovery using descriptor-free rearrangement clustering techniques for sparse data sets. AIChE J. 56(2):405-418 (2010)

[6] Wei Y, Thompson J, Floudas CA. CONCORD: a consensus method for protein secondary structure prediction via mixed integer linear optimization Proc. R. Soc. A 468(2139):831-85 (2012)

[7] Subramani A, Floudas CA. β-sheet Topology Prediction with High Precision and Recall for β and Mixed α/β Proteins. PLoS ONE 7(3):e32461 (2012)