(272b) Machine Learning Predicts Functional Classes of Family 7 Glycoside Hydrolases with High Accuracy | AIChE

(272b) Machine Learning Predicts Functional Classes of Family 7 Glycoside Hydrolases with High Accuracy


Gado, J. - Presenter, University of Kentucky
Payne, C. M., National Renewable Energy Laboratory
Ståhlberg, J., Swedish University of Agricultural Sciences
Borisova, A., Swedish University of Agricultural Sciences
Glycoside hydrolases (GH) are a class of enzymes that catalyze the hydrolysis of glycosidic bonds in saccharides. They are utilized in industries, such as the biofuel and textile industries, for enzymatic degradation and reorganization of saccharides. GHs are presently classified into 152 families based on sequence identity. Family 7 glycoside hydrolases (GH7s) are predominantly found in fungi and are often the largest composition by mass of the secretomes of cellulolytic fungi. In the biofuel industry, GH7s are the primary components of the enzymatic cocktails used in cellulose degradation. GH7s fall into one of two classes: cellobiohydrolases (CBHs) or endoglucanases (EGs). GH7 CBHs hydrolyze cellulose processively, i.e. they carry out multiple catalytic steps without dissociating from the substrate. GH7 EGs, on the other hand, are non-processive and dissociate from the substrate after hydrolyzing a glycosidic bond. Processive GH7s (CBHs) have become a focus of research because they provide the greatest hydrolytic potential in enzymatic cellulose degradation. As many of the known GH7 sequences have not yet been classified in terms of activity, we have set out to develop a predictive approach for classifying GH7 activity. We first retrieved a large and diverse set of 1,521 GH7 sequences from the genomic databases. The functional classes (i.e. CBH or EG) are reported for only about 30% of these GH7s. We trained multiple machine learning classifiers (decision tree, SVM, naïve Bayes and logistic regression) using known structural differences between GH7 CBHs and EGs as features. We determined, using Monte Carlo cross validation, that the overall accuracy of the machine learning classifiers ranges from 95 to 97%, suggesting that GH7 functional class can be readily predicted from sequence alone.