(679e) Identifying the Interacting Residues of a Protein Using Machine Learning: a Case-Study on Fluorescent Proteins | AIChE

(679e) Identifying the Interacting Residues of a Protein Using Machine Learning: a Case-Study on Fluorescent Proteins


Dubey, A. - Presenter, School of Chemical and Biomolecular Engineering, Georgia Institute of Technology
Lee, J. H. - Presenter, Korea Advanced Institute of Science and Technology (KAIST)
Realff, M. - Presenter, Georgia Institute of Technology
Bommarius, A. S. - Presenter, Georgia Institute of Technology

The mapping of a protein's sequence to its function is one of the most fundamental challenges in protein engineering and still remains an unsolved problem. This problem, if solved, can facilitate a rational design process in which the structure to function mapping is used to screen the potential designs. The enormous size of the sequence space or the state space of the problem, which can be defined as all the possible sequences that can be created for an enzyme, adds to the complexity of the problem. However, it is known that only a small fraction of the amino acids present in a protein contribute significantly to the protein's properties. Finding these amino acid positions can greatly improve our knowledge about proteins and also help us to design better experiments to alter them.

Apart from individual amino acids, it is known that in the three dimensional structure of a protein, certain amino acids can interact with each other in order to provide maintain structural integrity or aid in its catalytic function [1]. If these positions are mutated the loss of this interaction usually leads to a non-functional protein. Directed Evolution (DE) experiments [2], which probe the sequence space of a protein through mutations or recombination in search for an improved variant, frequently result in such inactive sequences. In this work, we extend our approach of using machine learning algorithms to find important amino acid residues [3] to interacting pairs. Boolean Learning and Support Vector Machines (SVMs) [4] are to identify pairs of interacting amino acid residues from the primary sequence of the variants that are generated during DE. It is shown this problem can be posed for Boolean Learning by transforming the sequences into Boolean vectors. The logical function that corresponds to the specific problem of finding amino acid residues with interactions is formulated in a Disjunctive Normal Form (DNF) [5]. A novel approach of combining SVMs with other algorithms like Boolean learning is also proposed and is applied in this problem. SVMs are unique in their ability to quantify the structural risk in terms of the generalization of the pattern learnt from the data. Most of the other learning algorithms like Boolean learning are solely based on empirical risk, which is their performance on the given data, without considering generalization or the performance on data not yet seen.

An extension of this approach to multiple rounds of evolution is proposed. The results obtained from one round can be used to design parent sequences for the subsequent rounds. The purpose of this strategy is to increase the average number of variants that retain catalytic activity as well as the average number of mutations obtained. Moreover, by using the variant sequences generated over multiple rounds, the identification of interacting pairs can also be improved when compared with using all sequences generated from the same round.

Simulations show that the pairs can be identified with a reasonably accuracy, which declines with increasing number of pairs per sequence and the length of the sequence. The combined approach with SVMs, in general gives better results that Boolean learning alone. The results from the multi-round strategy show that large improvements in the activity and the mutation levels of the libraries can be obtained. The identification of the interacting pairs is also significantly improved. The combined strategy again performs better, which justifies utilizing the structural nature of SVMs for an identification problem like this, which, because of its nature is well suited to Boolean learning.

To verify the strength of this approach, sequences from the recombination of mRFP and dsRED by using both RDA-PCR [6] and DNA-shuffling [7] were used to identify the interactions that exist between different residues in their sequence. The results from the algorithm showed an existing interaction between amino acids in positions 197 and 217. To verify these mutations, point mutations were made to both mRFP and dsRED sequences to interchange the amino acids in these two positions with the corresponding amino acids of the other sequence. The results show that a mutation in position 217 is tolerant in dsRED but not in mRFP. On the other hand, position 197 is intolerant to mutations for both the proteins. However, it was observed that a double mutant of dsRED with both the amino acids in 197 and 217 changed to the corresponding amino acids of mRFP was active, thus confirming the interaction between them.

To confirm the possible improvement in the subsequent rounds of recombination due to this result, DNA shuffling was performed between native mRFP and the active double mutant of dsRED. The library obtained was compared with the library by shuffling native mRFP and native dsRED.


1. Meyer, M.M., et al., Library analysis of SCHEMA-guided protein recombination. Protein Science: a Publication Of The Protein Society, 2003. 12(8): p. 1686-1693.

2. Petrounia, I.P. and F.H. Arnold, Designed evolution of enzymatic properties. Curr.Opin.Biotechnol., 2000. 11(4): p. 325-330.

3. Dubey, A., et al., Support vector machines for learning to identify the critical positions of a protein. Journal of Theoretical Biology, 2005. 234(3): p. 351-361.

4. Scholkopf, B. and A.J. Smola, Learning with Kernels. 2002, Cambridge: MIT Press.

5. Triantaphyllou, E., Inference of a Minimum Size Boolean Function from Examples by Using a New Efficient Branch-and-Bound Approach. Journal of Global Optimization, 1994. 5(1): p. 69-94.

6. Ikeuchi, A., et al., Chimeric Gene Library Construction by a Simple and Highly Versatile Method Using Recombination-Dependent Exponential Amplification. Biotechnology Progress, 2003. 19(5): p. 1460-1467.

7. Stemmer, W.P., Rapid evolution of a protein in vitro by DNA shuffling. Nature, 1994. 370(6488): p. 389-391.