(230b) A Novel Framework for Beta Sheet Topology Prediction in Purely Beta and Mixed Alpha-Beta Proteins

Subramani, A., Princeton University
Wei, Y., Princeton University
Floudas, C. A., Princeton University

We present a novel method for the prediction of beta sheet topology in purely beta and mixed alpha-beta proteins. The prediction of the topology is an important intermediate step in the determination of the three dimensional structure of the protein [1,2,3]. To this end, we propose a model which aims to maximize hydrophobic interactions that drive the formation of beta sheets in proteins. The model has been formulated as an integer linear optimization formulation, with binary variables used to represent contacts between beta strands and between specific hydrophobic amino acids in the beta strands. Constraints have been placed to provide biologically meaningful topologies. These include constraints which disallow cross linking in beta strands, and the elimination of the possibility of more than 2 contacts per beta strand [4]. In addition, chemical shift based information has been used to improve specificity in the identification of the native fold. Chemical shifts for specific atoms are representative of the local environment of the atom. Hence, chemical shifts of specific atoms have been seen to correlate well with existence in secondary structure elements. We use the extended SPARTA [5] database to collect chemical shift information on the triplets of residues present in beta strands. This data is clustered using the traveling salesman implementation of a novel clustering method, OREO [6]. The propensity of a tri-peptide to form one or two contacts, thus representing if it belongs to a central strand or a terminal strand of a beta sheet, is evaluated by collecting a consensus over the cluster that it belongs to. In addition, we have collected information on triplets of residues which exist in beta strands in the latest PDBSelect25 database, by drawing out a contact map of the frequency of contact of each tripeptide with all other tripeptides. This data is also used to constrain the model towards the native fold. For a five strand protein, the native structure may have all of these strands as a part of a single sheet, or divide them into two sheets. In order to reduce the computational complexity, and to improve the topology prediction, we include constraints derived out of pre-determination of the number of beta sheets that would be present, given the set of beta strands. We address this problem by using a support vector machine based prediction approach [7]. The basic premise of the approach tackles the problem of identification of loops linking beta strands which are likely to cause a break in a beta sheet. The input vector to the support vector machines include physico-chemical properties of amino acids, along with position specific scoring matrices for specific residues of the loop, derived out of BLAST. The entire set of constraints are implemented as an Integer Linear Programming (ILP) formulation. The primary advantage of having an ILP formulation stems out of the ability to create a rank-ordered list of sheet topologies, by means of including integer cut constraints. These constraints eliminate the set of best solutions at every iteration from the pool of possible solutions, and re-run the model to generate the next best solution.


[1] McAllister SR and Floudas CA(2010) An improved hybrid global optimization method for protein tertiary structure prediction, Comput. Optim. Appl., 45, 377-413

[2] Floudas CA, Fung HK, McAllister SR, Monnigmann M and Rajgaria R (2006) Advances in Protein Structure Prediction and De Novo Protein Design: A Review, Chem Engg. Sci., 61, 966-988

[3] Floudas CA (2007) Computational methods in protein structure prediction, Biotech. Bioeng., 97, 207-213

[4] Klepeis JL and Floudas CA (2003) Prediction of beta-sheet topology and disulfide bridges in polypeptides, J Comput Chem., 24, 191-208

[5] Shen Y and Bax A (2007) Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology, J Biomol NMR, 38, 289-302

[6] DiMaggio PA, McAllister SR and Floudas CA (2008) Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies, BMC Bioinf., 9, 458

[7] Fan R.-E, Chen P.-H and Lin C.-J (2005) Working set selection using second order information for training SVM, J Machine Learning Res., 6, 1889-1918