(117c) Rank-Ordering of Known Enzymes As Starting Points for Re-Engineering Novel Substrate Activity Using a Convolutional Neural Network | AIChE

(117c) Rank-Ordering of Known Enzymes As Starting Points for Re-Engineering Novel Substrate Activity Using a Convolutional Neural Network


Boorla, V. S., Pennsylvania State University
Retrosynthetic approaches have made significant advances in predicting synthesis routes of target biofuel, bio-renewable or bio-active molecules. The use of only cataloged enzymatic activities limits the discovery of new production routes. Recent retrosynthetic algorithms increasingly use novel conversions that require altering the substrate or cofactor specificity of an existing enzyme to complete the pathway. Although suggesting such novel conversions is only limited by the computational expanse of the search algorithm, the identification and re-engineering of the enzymes needed to complete the pathway is currently the bottleneck in the implementation of these designs. Here we present EnzRank, a convolutional neural network (CNN) approach, to rank-order existing enzymes in terms of their suitability to undergo successful protein engineering through directed evolution or de novo design towards the desired specific activity. EnzRank was inspired by the recent rapid progress of machine learning-based approaches in predicting enzyme classes, properties, and functions with tools such as alphafold2, DeepEC, and SDN2GO. EnzRank requires encoding both the similarity of the native vs. novel substrate(s) and the plasticity of the targeted enzyme to assign a probability score that a given enzyme has activity on a given substrate. The calculated probability score by EnzRank of the enzyme-substrate pairs can then be used to rank-order all candidate enzymes as to their potential to exhibit activity (even residual) for the novel substrate. Therefore, one must account for both substrate and enzyme information simultaneously to create a rank-order list to select a starting enzyme for any de novo reaction step in a pathway design, which is our primary goal. It is important to note that even though numerous algorithms predict EC classifications given an enzyme sequence, this level of detail is not sufficient for our goal as all candidate enzymes would presumably be classified with the same EC number. The varying length of the protein sequences makes it challenging to describe the protein features for the machine learning models. Moreover, only certain parts of the protein, such as specific residues, are involved in the enzyme-substrate activity rather than the complete protein structure. Hence, the physicochemical properties of the entire protein sequence do not seem to be the appropriate feature for predicting activity due to the noise information from the portions of the sequence that are not involved in the enzyme-substrate activity. Thus, the extraction of the local residue patterns involved in actual enzyme-substrate interaction is necessary to make an accurate prediction. We obtained known active enzyme-substrate pairs from the BRENDA database. These data also generated scrambled negative pairs by using the Tanimoto-based chemical similarity index to pair enzymes substrates that are entirely dissimilar to their native substrate. We created a total of 11,080 enzyme-substrate pairs and 11,076 generated negative pairs. Next, we used the holdout method to train and cross-validate the CNN method (using an 80:10:10 split for training and validation and testing data) on ten random data splits. EnzRank achieves an average recovery rate of 81.6% and 74.9% for positive and negative pairs and performs similarly on test data split with 81.5% positive and 73.8% negative recovery. We demonstrate a graphical user interface (which will be made publicly available at https://github.com/maranasgroup/EnzRank) that predicts enzyme-substrate activity from customized user input of novel substrates as SMILES strings and enzyme sequence information.